[
Lists Home |
Date Index |
Thread Index
]
- From: "Wayne Steele" <xmlmaster@hotmail.com>
- To: xml-dev@xml.org
- Date: Fri, 31 Mar 2000 20:56:38 PST
There's been some discussion lately about a binary representation for XML
documents.
None of the binary-xml proposals I've seen so look that useful to me, so let
me present one that I think would make sense.
If anyone finds this interesting, perhaps we can move forward and implement
it. As a placeholder for a real name, I call this 'binxml'.
Binxml is a compression format for XML documents. A well-formed XML Document
(A) is mapped into binxml, which is stored or transmitted to another
application. At a future point, the binxml is mapped back into an XML
Document (B).
Documents A and B should be identical for any significant purpose.
People may disagree about what is significant and what is not. I have
preserved all the obvious things, as well as the Internal DTD subset, and
prolog and suffix PIs and comments.
I have NOT preserved Whitespace in these places:
Inside of the DTD
Outside of the Document Element
Between attributes
I have also not preserved the exact placement of namespace nodes, but I have
allowed you to keep the prefices. This might be a problem for some DTDs.
I'm assuming the document is well-formed to begin with. It should remain
equally valid or invalid, except for the possible changes in namespace
nodes.
I have not created an encoding for external DTD subsets. I don't see the
same needs for compression wrt external DTDs. Just exchange them in plain
text, like you do now.
The code points 0x00 - 0x08, 0x0B, 0x0C, 0x0E - 0x1F have been declared to
be illegal in XML documents, so I have used these as binxml tokens.
You can use whatever unicode encoding you want, as long as it doesn't use
the listed code points for special purposes. Binxml preserves it.
Here's the actual mapping:
a binxml file (or stream, or whatever) looks like this:
1. byte-order-mark; if you're using UTF-16
2. a "magic" string; Everybody else seems to be doing it. Actual value
TBD.
3. XMLDECL; A single token in lieu of the XML Declaration
4. encoding string; Optional. The document's EncName.
5. String Table Section; Mandatory.
6. Prolog PIs, Comments; If present. The XML Decl is not included.
7. DocTypeDecl and DTD; If present.
8. Prolog PIs, Comments; If present.
9. Document Element and contents; Mandatory.
10. Suffix PIs and Comment; if present
1. Byte order mark.
This is just like XML. Because binxml tokens are defined as unicode code
points, the encoding needs to be determined up front. If there is no BOM,
UTF-8 will be assumed, until the end of the encoding string.
2. "magic" string.
This is just a additional check that you've got the right file type. How
many characters is about right for this? three?
How about: "bx0" for binxml version zero.
3. XMLDECL
The XML Declaration in the original document is mapped to this one token.
I am assuming XML version "1.0". If another one comes out, we can just add
new codes here.
There are three possibilities for the standalone declaration: yes, no, and
not present.
The most common encoding declarations are 'UTF-8' and 'UTF-16', so I have
made special allowance for them.
If the document has no encoding declaration, use an entry that says
'encoding follows', but omit section 4.
If the document has no XMLDECL, use 0x9.
Values:
0x1 standalone="yes" encoding="utf-8"
0x2 standalone="no" encoding="utf-8"
0x3 standalone unspecified; encoding="utf-8"
0x4 standalone="yes" encoding="utf-16"
0x5 standalone="no" encoding="utf-16"
0x6 standalone is unspecified; encoding="utf-16"
0x7 standalone="yes"; encoding follows
0x8 standalone="no"; encoding follows
0x9 standalone is unspecified; encoding follows
4. Encoding String.
This section may only be present if the XMLDECL token is 0x7,0x8, or 0x9.
Valid characters are [a-zA-Z0-9_.:] and '-'.
The encoding takes effect (and ends section 4) with the first character
outside of this range. The next character should be a binxml token, and they
are all outside this range.
Optionally, you may follow the Encoding string with a NUL (0x00). This might
be needed to mark where the encoding begins for some really weird ones.
5. String Table Section.
Each entry is sequentially numbered, starting with one. There are five entry
types.
When you see [index], it means a reference to one of these entries. [index]
is the size of one unicode code point, so it can be as large as 0x10FFFF, if
you use surrogates. I'm hoping this will be enough for everyone's documents.
This section ends when you hit a binxml token other than 0x0 - 0x4.
NamespaceEntry (no prefix specified):
0x1, followed by the text of the namespace URI.
When unencoded, any prefix may be used for the namespace declaration in
the final document.
Elements and attributes in this namespace will of course use that prefix.
NamespaceEntry (prefix specified):
0x1, followed by the text of namespace URI, 0x0, text of prefix
When unencoded, the same prefix must be used in the output document.
Personally, I frown upon giving special meaning to prefices, but XSTL
seems to need this.
NameEntry
0x2, followed by the text of the Name
QNameEntry
0x3, [index], followed by the text of the BaseName
The [index] here is for the corresponding namespace to qualify this QName.
CDataEntry
0x4, followed by the text
If the text needs to have an Entity Reference in it, you may include it
with two characters: 0x0, followed by the [index].
EntityReference
0x0, [index]
[index] is the Name for this EntRef.
6. Prolog PIs, Comments
If there are Processing Instructions and/or Comments in the document before
any DocType declaration, they go here.
Do NOT put the XML Declaration here. It is addressed in section 3.
This section ends when you hit either 0x07, a DocType declaration, or 0x8 or
0xB, for the Document Element.
PI
0x5, [index], text content of the PI
The [index] is for the Name or QName that is the target of the PI. It is
possible for there to be no text content.
Comment
0x6, followed the the content of the comment
7. DocType Declaration and DTD
This section (if present) always starts with a DocType declaration.
This may be followed by a PUBID and SYSID (in any order), if these are
present in the document.
Next are any declarations in the Internal DTD Subset (if any).
This section ends with 0x5, 0x6 (a PI or Comment following the DTD, go to
section 8), or 0x8, 0xB (Document Element).
DocType Declaration
0x7, followed by the name of the doctype
PUBID
0x1, followed by the text of the formal public identifier
SYSID
0x2, followed by the URI for the System ID
I'm going to skip the internal DTD subset, and come back to it later.
8. Prolog PIs, Comments
This is just like section 6, except it can't be followed by a DocTYpe
declaration.
This is for PIs and Comments that follow the DTD, but proceed the Document
Element.
9. Document Element and Contents
This is, of course, the meat of the XML Document. In most binxml, this will
immediately follow the String Table.
Everything in this section is represented in the same order it appears in
the source document. Attributes immediately follow their containing element.
The two different Attribute types may be freely interchanged. Attributes
that declare namespaces (ie, namespace nodes) are not represented. This
section ends at the end of the first element.
ElementStart
0x8, [index]
[index] is for the Name or QName of this element. Any Attributes must
follow next. Everything else following, until an EndElement token is
reached, is contained by this element.
EmptyElementStart
0xB, [index]
Like ElementStart, except this element has no child elements or other
content - attributes only. Any element start token immediately following
this one is a sibling, not a child.
EndElement
0x6
AttributeInterned
0xC, [index], [index2]
[index] is the Name or Qname of this attribute. Only use a QName if the
document had this attribute EXPLICITY qualified (ie, a global attribute).
[index2] is the entry for the value of this attribute. It does not have to
be a CDataEntry - it may be any other kind as well.
AttributeLiteral
0x7, [index], text value of attribute
This attribute has the value inline instead of in the String Table. If you
need an Entity Reference inside the attribute value, you may include it.
EntityReferenceInsideAttribute
0x0, [index]
The other tokens can be present in any order inside the content of an
element. If text exists without a strarting token, it is just a regular text
node.
CData
0x4, text inside the CData Section
PI
0x11, [index], text inside the PI
Comment
0x10, text of the comment
EntityReference
0x5, [index]
Text
0x3, the text itself
This token is only used when a text node immediately follows a comment, a
PI, a CDATA Section, or a literal attribute value. Otherwise text identifies
itself without any token.
Interned Cdata
0x2, [index]
The index is to a String Table entry of any type. The contents of that
Entry are copy/pasted right here.
This may appear inside of Text, a Comment, PI, or literal Attribute Value.
10. Suffix PIs and Comments
If you have any PIs or Comments after the Document Element that you care
about, put them here. This is just like sections 6 or 8.
DTDs, which I said I would come back to.
After the DocType declaration (section 7), may follow any number of these
DTD Tokens, in (mostly) document order.
There will be no Marked sections or Parameter Entities, as they aren't
allowed inside the internal subset.
Attlist declarations are folded into the element they go with.
A different token is used for an element declaration depending on the
content type.
ElementDecl, Content Type 'EMPTY'
0x3, [index]
ElementDecl, Content Type 'ANY'
0x4, [index]
ElementDecl, Detailed Content Type Specified
0x6, [index], followed by Content Stuff
Content Stuff
in any order, one of the characters "(),|?+*" or 0x7 followed by [index],
or 0x0 (meaning #PCDATA)
Any Attributes for this Element must be declared next. A different token or
token-pair is used depending on the type of the attribute. There are forty
attribute types: the cross section of {REQUIRED, IMPLIED, default value,
fixed default value} and {
CDATA,ID,IDREF,IDREFS,ENTITY,ENTITIES,NMTOKEN,NMTOKENS,enumerated ,
enumerated notations}. I have tried to optimize it so the most commonly used
declaration just take one token, where the most obscure ones take two.
Any Fixed, Default, or enumerated attribute values must be in the String
Table. The indexes for these below are shown as [fixed index] or [default
index]. Enumerated type may have any number of index entries, terminated by
a 0x0. For fixed or Default enumerated types, the first one listed is the
default.
REQUIRED_CDATA 0x17, [index]
IMPLIED_CDATA 0x18, [index]
FIXED_CDATA 0x19, [index], [fixed index]
DEFAULT_CDATA 0x1A, [index], [default index]
REQUIRED_ID 0xC, 0x1, [index]
IMPLIED_ID 0x1B, [index]
FIXED_ID 0xC, 0x2, [index], [fixed index]
DEFAULT_ID 0xC, 0x3, [index], [default index]
REQUIRED_IDREF 0xC, 0x4, [index]
IMPLIED_IDREF 0x1C, [index]
FIXED_IDREF 0xC, 0x5, [index], [fixed index]
DEFAULT_IDREF 0xC, 0x6, [index], [default index]
REQUIRED_IDREFS 0xC, 0x7, [index]
IMPLIED_IDREFS 0x1D, [index]
FIXED_IDREFS 0xC, 0x8, [index], [fixed index]
DEFAULT_IDREFS 0xC, 0x9, [index], [default index]
REQUIRED_ENTITY 0xC, 0xa, [index]
IMPLIED_ENTITY 0xC, 0xb, [index]
FIXED_ENTITY 0xC, 0xc, [index], [fixed index]
DEFAULT_ENTITY 0xC, 0xd, [index], [default index]
REQUIRED_ENTITIES 0xC, 0xe, [index]
IMPLIED_ENTITIES 0xC, 0xf, [index]
FIXED_ENTITIES 0xC, 0x10, [index], [fixed index]
DEFAULT_ENTITIES 0xC, 0x11, [index], [default index]
REQUIRED_NMTOKEN 0xC, 0x12, [index]
IMPLIED_NMTOKEN 0xC, 0x13, [index]
FIXED_NMTOKEN 0xC, 0x14, [index], [fixed index]
DEFAULT_NMTOKEN 0xC, 0x15, [index], [default index]
REQUIRED_NMTOKENS 0xC, 0x16, [index]
IMPLIED_NMTOKENS 0xC, 0x17, [index]
FIXED_NMTOKENS 0xC, 0x18, [index], [fixed index]
DEFAULT_NMTOKENS 0xC, 0x19, [index], [default index]
REQUIRED_ENUM 0x1E, [index], [value index 1] ... [value index n], 0x00
IMPLIED_ENUM 0x1F, [index], [value index 1] ... [value index n], 0x00
FIXED_ENUM 0x1, [index], [value index 1] ... [value index n], 0x00
DEFAULT_ENUM 0x2, [index], [default index], [value index 1] ... [value
index n], 0x00
REQUIRED_NOTATIONENUM 0xC, 0x1a, [index], [value index 1] ... [value index
n], 0x00
IMPLIED_NOTATIONENUM 0xC, 0x1b, [index], [value index 1] ... [value index
n], 0x00
FIXED_NOTATIONENUM 0xC, 0x1c, [index], [fixed index], [value index 1] ...
[value index n], 0x00
DEFAULT_NOTATIONENUM 0xC, 0x1d, [index], [default index], [value index 1]
... [value index n], 0x00
Other things you might see in the Internal DTD Subset:
PUBID and SYSID are just like in section 7, both are optional, and may occur
in either order.
NotationDeclaration
0x14, [index], PUBID?, SYSID?
PI
0x15, [index], content
Comment
0x16, content
Internal Entity Decl
0x12, [index], replacement text
If you need to embed another entity reference in the replacement text,
stick in ( 0x13, [index] )
Entity Reference inside an Entity Decl
0x13, [index]
Parsed External Entity Decl
0xF, [index], PUBID?, SYSID?
Unparsed External Entity Decl
0xE, [index], [ndata index], PUBID?, SYSID?
Interned Cdata
0x2, [index]
This may only appear (in the DTD) inside of the content of a comment or PI
Whew!
Not that complicated, but kind of tedious.
I hope there are no tokens which would be ambiguous - if there are, it's an
error of mine.
Open Questions:
Should further compression be done for text content?
Should it be allowed for the string table to be sprinkled throughout the
document, to make it easier to stream-encode XML?
Feel free to tell me if you think this is crap, I can take it.
Constructive comments are even more welcome.
-Wayne Steele
______________________________________________________
Get Your Private, Free Email at http://www.hotmail.com
***************************************************************************
This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/
***************************************************************************
|