Lists Home |
Date Index |
- To: "'xml-dev'" <firstname.lastname@example.org>
- Subject: Opinions
- From: Paul Prescod <email@example.com>
- Date: Thu, 20 Mar 2003 11:01:26 -0800
- User-agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.3a) Gecko/20021212
I'm curious whether anyone has proposed something like this before. I
don't recall stumbling upon it. It just came to me during a bout of
insomnia. Don't sweat the details...these are late night ramblings.
The Extensible Data Header is a standardized way for text documents to
self-identify their text encoding, MIME type and other metadata.
One of the most persistently annoying issues in data management is
keeping metadata with the data it describes. The most difficult (and
important) sort of data to track is the "format" (encoding and media
type) of files. There are a variety of platform specific ways to solve
parts of the problem (file extensions, filesystem attributes, shebang
lines) but none of them survive the various mechanisms for transmitting
data entities, from FTP to HTTP to Jsbber.
XML has demonstrated the wide applicability of a solution: transmit the
metadata as part of the same stream as the data. Furthermore, XML
defines (explicitly and implicity) a bootstrapping process whereby you
can detect the fact that the data is XML through its XML declaration,
its XML version through its version declaration, its encoding through
its encoding declaration and its vocabulary through a DOCTYPE or
namespace declaration. This series of bootstraps has been wildly
successful. With XML 1.1, it is possible for a PalmOS-based XML parser
to reliably detect and decode an SVG document encoded in EBCDIC and
using Macintosh newline conventions. (if Macintosh newline conventions
are possible in EBCDIC??). XDH aims to extend this level of
self-descriptiveness to other data formats.
<?text/rtf version="1.5" encoding="ASCII"
<?application/zip version="1.0" encoding="ASCII" dataEncoding="binary"
An XDH Document is a stream of bytes starting with a region of text
known as a Header.
document ::= (header | extendedHeader) separator Body
A header is a stream of bytes in some Unicode encoding (including
historical national encodings such as ASCII, Shift-JIS, etc.). The
algorithm for auto-detecting the encoding is the same as that for XML.
The production for header describes the post-decoding character
header ::= typeDeclaration metadata?
typeDeclaration ::= '<?' TypeDecl?
TypeDecl ::= mimeType | TypeURI
TypeURI ::= URI
DocURI ::= URI
metadata ::= a single element with element type "xml:meta"
The MimeType is a mime type.
The TypeURI is a type identifier in URI rather than MIME syntax.
Ideally, it can be dereferenced to return information that could be both
human and machine readable. Two media types with different TypeURIs are
presumed to be different for the purposes of this specification (just as
if they were declared with two distinct MIME types).
The DocURI is a pointer to human or machine readable documentation about
the data format and type. It is distinguished from the TypeURI in that
it is not considered an identifier. You could point to one URI for
information about the ZIP file format and I could point to another.
VersionInfo is any string that meets the XML production of the same
name. Its meaning is designed to be defined by the description of the
The Encoding declaration is as defined in XML. It has the same defaults
The DataEncodingDecl is a pseudo-attribute named "dataEncoding". It
defines the Unicode encoding not for the header but for the Body. The
value "binary" is used to indicate that no Unicode decoding should be
attempted for the Body. If the DataEncodingDecl is omitted, it defaults
to the same encoding as the header.
Theh XmlVersionDecl declares what version of XML is in use. It defaults
to 1.1 (???).
The metadata is just an XML element with arbitrary children and
attributes. Each child element and attribute must have an XML namespace
and processors should ignore elements or attributes in namespaces they
are not programmed to recognize.
If the Body is in a different encoding than the header (especially
binary) then the separator must be the character sequence FF, SUB, EOT
(aka "^L^Z^D" aka "FORM FEED", "SUBSTITUTE", "END OF TRANSMISSION")
which should serve to visually separate the text from the binary data in
the terminal programs of most computers.
If the Body is in the same encoding as the header then the first line of
the Body is either the line immediately following the "xml:meta" element
or (if there is no such element) the line immediately following the
typeDeclaration. If the Data begins with text of the form "<xml:meta"
then the metadata element defined by this specification may not be omitted.
The Extended Header
The extended header is designed to support pre-existing uses for the
first lines of files. It basically defines syntactic variations of the
base header that are allowed for file formats designed before XDH (for
instance programming language files).
extendedheader ::= shebangLine? CCommentStart? header CCommentEnd?
shebangLine ::= #! Char* #xA
CCommentStart ::= S? "/*" S?
CCommentEnd ::= S? "*/" S?
In an extended header, any line may begin with a shellComment or
CPlusComment. If so, the comment is ignored and the data is treated as
if it did not exist.
shellComment ::= S? ("#" S?)+
CPlusComment ::= S? ("//" S?)+
# <?application/x-python version="2.3"?>
This specification does not change the definition of any pre-existing
media types. They should be interpreted as per their various
specifications. For example, most Unix systems will not support UCS-2
shell scripts even though this specification might allow such a declaration.
The specification does, however, allow the addition of metadata to those
media types for software applications that understand this specification.
It is anticipated that new specifications will make normative references
to this one so that this mechanism can replace the various ad hoc
mechanisms for self-description and inline metadata.