OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   RE: [xml-dev] Opinions

[ Lists Home | Date Index | Thread Index ]


I've been bothered by the "format" problem for a while now.
Here's a draft article + notes I started to write a bit ago,
but haven't touched in months:


> The TypeURI is a type identifier in URI rather than MIME syntax.


I haven't seen anything like this published, and I would be glad
to do an in-depth analysis/critique if you are serious about
persuing this.

- Chris 

-----Original Message-----
From: Paul Prescod [mailto:paul@prescod.net]
Sent: Thursday, March 20, 2003 2:01 PM
To: 'xml-dev'
Subject: [xml-dev] Opinions

I'm curious whether anyone has proposed something like this before. I 
don't recall stumbling upon it. It just came to me during a bout of 
insomnia. Don't sweat the details...these are late night ramblings.



The Extensible Data Header is a standardized way for text documents to
self-identify their text encoding, MIME type and other metadata.

Problem Statement:

One of the most persistently annoying issues in data management is
keeping metadata with the data it describes. The most difficult (and
important) sort of data to track is the "format" (encoding and media
type) of files. There are a variety of platform specific ways to solve
parts of the problem (file extensions, filesystem attributes, shebang
lines) but none of them survive the various mechanisms for transmitting
data entities, from FTP to HTTP to Jsbber.

XML has demonstrated the wide applicability of a solution: transmit the
metadata as part of the same stream as the data. Furthermore, XML
defines (explicitly and implicity) a bootstrapping process whereby you
can detect the fact that the data is XML through its XML declaration,
its XML version through its version declaration, its encoding through
its encoding declaration and its vocabulary through a DOCTYPE or
namespace declaration. This series of bootstraps has been wildly
successful. With XML 1.1, it is possible for a PalmOS-based XML parser
to reliably detect and decode an SVG document encoded in EBCDIC and
using Macintosh newline conventions. (if Macintosh newline conventions
are possible in EBCDIC??). XDH aims to extend this level of
self-descriptiveness to other data formats.


<?text/rtf version="1.5" encoding="ASCII"

<?application/zip version="1.0" encoding="ASCII" dataEncoding="binary"



An XDH Document is a stream of bytes starting with a region of text
known as a Header.

document ::= (header | extendedHeader) separator Body

A header is a stream of bytes in some Unicode encoding (including
historical national encodings such as ASCII, Shift-JIS, etc.). The
algorithm for auto-detecting the encoding is the same as that for XML.

The production for header describes the post-decoding character

header ::= typeDeclaration metadata?

typeDeclaration ::= '<?' TypeDecl?

TypeDecl ::= mimeType | TypeURI

TypeURI ::= URI

DocURI ::= URI

metadata ::= a single element with element type "xml:meta"

The MimeType is a mime type.

The TypeURI is a type identifier in URI rather than MIME syntax.
Ideally, it can be dereferenced to return information that could be both
human and machine readable. Two media types with different TypeURIs are
presumed to be different for the purposes of this specification (just as
if they were declared with two distinct MIME types).

The DocURI is a pointer to human or machine readable documentation about
the data format and type. It is distinguished from the TypeURI in that
it is not considered an identifier. You could point to one URI for
information about the ZIP file format and I could point to another.

VersionInfo is any string that meets the XML production of the same
name. Its meaning is designed to be defined by the description of the
MIME type.

The Encoding declaration is as defined in XML. It has the same defaults
as XML.

The DataEncodingDecl is a pseudo-attribute named "dataEncoding". It
defines the Unicode encoding not for the header but for the Body. The
value "binary" is used to indicate that no Unicode decoding should be
attempted for the Body. If the DataEncodingDecl is omitted, it defaults
to the same encoding as the header.

Theh XmlVersionDecl declares what version of XML is in use. It defaults 
to 1.1 (???).

The metadata is just an XML element with arbitrary children and 
attributes. Each child element and attribute must have an XML namespace 
and processors should ignore elements or attributes in namespaces they 
are not programmed to recognize.

If the Body is in a different encoding than the header (especially 
binary) then the separator must be the character sequence FF, SUB, EOT 
which should serve to visually separate the text from the binary data in 
the terminal programs of most computers.

If the Body is in the same encoding as the header then the first line of 
the Body is either the line immediately following the "xml:meta" element 
or (if there is no such element) the line immediately following the 
typeDeclaration. If the Data begins with text of the form "<xml:meta" 
then the metadata element defined by this specification may not be omitted.

The Extended Header

The extended header is designed to support pre-existing uses for the 
first lines of files. It basically defines syntactic variations of the 
base header that are allowed for file formats designed before XDH (for 
instance programming language files).

extendedheader ::= shebangLine? CCommentStart? header CCommentEnd?
shebangLine ::= #! Char* #xA
CCommentStart ::= S? "/*" S?
CCommentEnd ::= S? "*/" S?

In an extended header, any line may begin with a shellComment or 
CPlusComment. If so, the comment is ignored and the data is treated as 
if it did not exist.

shellComment ::= S? ("#" S?)+
CPlusComment ::= S? ("//" S?)+

For example:

	# <?application/x-python version="2.3"?>
	import x
	import y
	print "z"

Backwards Compatibility

This specification does not change the definition of any pre-existing 
media types. They should be interpreted as per their various 
specifications. For example, most Unix systems will not support UCS-2 
shell scripts even though this specification might allow such a declaration.

The specification does, however, allow the addition of metadata to those 
media types for software applications that understand this specification.

It is anticipated that new specifications will make normative references 
to this one so that this mechanism can replace the various ad hoc 
mechanisms for self-description and inline metadata.

The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
initiative of OASIS <http://www.oasis-open.org>

The list archives are at http://lists.xml.org/archives/xml-dev/

To subscribe or unsubscribe from this list use the subscription
manager: <http://lists.xml.org/ob/adm.pl>


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS