xml-dev - binxml proposal

binxml proposal
[ Lists Home | Date Index | Thread Index ]
From: "Wayne Steele" <xmlmaster@hotmail.com>
To: xml-dev@xml.org
Date: Fri, 31 Mar 2000 20:56:38 PST
There's been some discussion lately about a binary representation for XML 
documents.
None of the binary-xml proposals I've seen so look that useful to me, so let 
me present one that I think would make sense.
If anyone finds this interesting, perhaps we can move forward and implement 
it. As a placeholder for a real name, I call this 'binxml'.

Binxml is a compression format for XML documents. A well-formed XML Document 
(A) is mapped into binxml, which is stored or transmitted to another 
application. At a future point, the binxml is mapped back into an XML 
Document (B).

Documents A and B should be identical for any significant purpose.

People may disagree about what is significant and what is not. I have 
preserved all the obvious things, as well as the Internal DTD subset, and 
prolog and suffix PIs and comments.

I have NOT preserved Whitespace in these places:
	Inside of the DTD
	Outside of the Document Element
	Between attributes

I have also not preserved the exact placement of namespace nodes, but I have 
allowed you to keep the prefices. This might be a problem for some DTDs.

I'm assuming the document is well-formed to begin with. It should remain 
equally valid or invalid, except for the possible changes in namespace 
nodes.

I have not created an encoding for external DTD subsets. I don't see the 
same needs for compression wrt external DTDs. Just exchange them in plain 
text, like you do now.

The code points 0x00 - 0x08, 0x0B, 0x0C, 0x0E - 0x1F have been declared to 
be illegal in XML documents, so I have used these as binxml tokens.

You can use whatever unicode encoding you want, as long as it doesn't use 
the listed code points for special purposes. Binxml preserves it.


Here's the actual mapping:

a binxml file (or stream, or whatever) looks like this:

	1. byte-order-mark;			if you're using UTF-16
	2. a "magic" string;			Everybody else seems to be doing it. Actual value 
TBD.
	3. XMLDECL;					A single token in lieu of the XML Declaration
	4. encoding string;			Optional. The document's EncName.
	5. String Table Section;		Mandatory.
	6. Prolog PIs, Comments;		If present. The XML Decl is not included.
	7. DocTypeDecl and DTD;			If present.
	8. Prolog PIs, Comments;		If present.
	9. Document Element and contents; 	Mandatory.
	10. Suffix PIs and Comment;		if present


1. Byte order mark.
This is just like XML. Because binxml tokens are defined as unicode code 
points, the encoding needs to be determined up front. If there is no BOM, 
UTF-8 will be assumed, until the end of the encoding string.

2. "magic" string.
This is just a additional check that you've got the right file type. How 
many characters is about right for this? three?
How about: "bx0" for binxml version zero.

3. XMLDECL
The XML Declaration in the original document is mapped to this one token.
I am assuming XML version "1.0". If another one comes out, we can just add 
new codes here.
There are three possibilities for the standalone declaration: yes, no, and 
not present.
The most common encoding declarations are 'UTF-8' and 'UTF-16', so I have 
made special allowance for them.
If the document has no encoding declaration, use an entry that says 
'encoding follows', but omit section 4.
If the document has no XMLDECL, use 0x9.


Values:
	0x1		standalone="yes" encoding="utf-8"
	0x2		standalone="no" encoding="utf-8"
	0x3		standalone unspecified; encoding="utf-8"
	0x4		standalone="yes" encoding="utf-16"
	0x5		standalone="no" encoding="utf-16"
	0x6		standalone is unspecified; encoding="utf-16"
	0x7		standalone="yes"; encoding follows
	0x8		standalone="no"; encoding follows
	0x9		standalone is unspecified; encoding follows

4. Encoding String.
This section may only be present if the XMLDECL token is 0x7,0x8, or 0x9.
Valid characters are [a-zA-Z0-9_.:] and '-'.
The encoding takes effect (and ends section 4) with the first character 
outside of this range. The next character should be a binxml token, and they 
are all outside this range.
Optionally, you may follow the Encoding string with a NUL (0x00). This might 
be needed to mark where the encoding begins for some really weird ones.

5. String Table Section.
Each entry is sequentially numbered, starting with one. There are five entry 
types.
When you see [index], it means a reference to one of these entries. [index] 
is the size of one unicode code point, so it can be as large as 0x10FFFF, if 
you use surrogates. I'm hoping this will be enough for everyone's documents.
This section ends when you hit a binxml token other than 0x0 - 0x4.

	NamespaceEntry (no prefix specified):
		0x1, followed by the text of the namespace URI.
		When unencoded, any prefix may be used for the namespace declaration in 
the final document.
		Elements and attributes in this namespace will of course use that prefix.

	NamespaceEntry (prefix specified):
		0x1, followed by the text of namespace URI, 0x0, text of prefix
		When unencoded, the same prefix must be used in the output document.
		Personally, I frown upon giving special meaning to prefices, but XSTL 
seems to need this.

	NameEntry
		0x2, followed by the text of the Name

	QNameEntry
		0x3, [index], followed by the text of the BaseName
		The [index] here is for the corresponding namespace to qualify this QName.

	CDataEntry
		0x4, followed by the text
		If the text needs to have an Entity Reference in it, you may include it 
with two characters: 0x0, followed by the [index].

	EntityReference
		0x0, [index]
		[index] is the Name for this EntRef.



6. Prolog PIs, Comments
If there are Processing Instructions and/or Comments in the document before 
any DocType declaration, they go here.
Do NOT put the XML Declaration here. It is addressed in section 3.
This section ends when you hit either 0x07, a DocType declaration, or 0x8 or 
0xB, for the Document Element.

	PI
		0x5, [index], text content of the PI
		The [index] is for the Name or QName that is the target of the PI. It is 
possible for there to be no text content.

	Comment
		0x6, followed the the content of the comment


7. DocType Declaration and DTD
This section (if present) always starts with a DocType declaration.
This may be followed by a PUBID and SYSID (in any order), if these are 
present in the document.
Next are any declarations in the Internal DTD Subset (if any).
This section ends with 0x5, 0x6 (a PI or Comment following the DTD, go to 
section 8), or 0x8, 0xB (Document Element).

	DocType Declaration
		0x7, followed by the name of the doctype

	PUBID
		0x1, followed by the text of the formal public identifier

	SYSID
		0x2, followed by the URI for the System ID

I'm going to skip the internal DTD subset, and come back to it later.



8. Prolog PIs, Comments
This is just like section 6, except it can't be followed by a DocTYpe 
declaration.
This is for PIs and Comments that follow the DTD, but proceed the Document 
Element.



9. Document Element and Contents
This is, of course, the meat of the XML Document. In most binxml, this will 
immediately follow the String Table.
Everything in this section is represented in the same order it appears in 
the source document. Attributes immediately follow their containing element. 
The two different Attribute types may be freely interchanged. Attributes 
that declare namespaces (ie, namespace nodes) are not represented. This 
section ends at the end of the first element.

	ElementStart
		0x8, [index]
		[index] is for the Name or QName of this element. Any Attributes must 
follow next. Everything else following, until an EndElement token is 
reached, is contained by this element.

	EmptyElementStart
		0xB, [index]
		Like ElementStart, except this element has no child elements or other 
content - attributes only. Any element start token immediately following 
this one is a sibling, not a child.

	EndElement
		0x6

	AttributeInterned
		0xC, [index], [index2]
		[index] is the Name or Qname of this attribute. Only use a QName if the 
document had this attribute EXPLICITY qualified (ie, a global attribute). 
[index2] is the entry for the value of this attribute. It does not have to 
be a CDataEntry - it may be any other kind as well.

	AttributeLiteral
		0x7, [index], text value of attribute
		This attribute has the value inline instead of in the String Table. If you 
need an Entity Reference inside the attribute value, you may include it.

	EntityReferenceInsideAttribute
		0x0, [index]


The other tokens can be present in any order inside the content of an 
element. If text exists without a strarting token, it is just a regular text 
node.


	CData
		0x4, text inside the CData Section

	PI
		0x11, [index], text inside the PI

	Comment
		0x10, text of the comment

	EntityReference
		0x5, [index]

	Text
		0x3, the text itself
		This token is only used when a text node immediately follows a comment, a 
PI, a CDATA Section, or a literal attribute value. Otherwise text identifies 
itself without any token.

	Interned Cdata
		0x2, [index]
		The index is to a String Table entry of any type. The contents of that 
Entry are copy/pasted right here.
		This may appear inside of Text, a Comment, PI, or literal Attribute Value.


10. Suffix PIs and Comments
	If you have any PIs or Comments after the Document Element that you care 
about, put them here. This is just like sections 6 or 8.



DTDs, which I said I would come back to.
	After the DocType declaration (section 7), may follow any number of these 
DTD Tokens, in (mostly) document order.
	There will be no Marked sections or Parameter Entities, as they aren't 
allowed inside the internal subset.
	Attlist declarations are folded into the element they go with.
	A different token is used for an element declaration depending on the 
content type.

	ElementDecl, Content Type 'EMPTY'
		0x3, [index]

	ElementDecl, Content Type 'ANY'
		0x4, [index]

	ElementDecl, Detailed Content Type Specified
		0x6, [index], followed by Content Stuff

	Content Stuff
		in any order, one of the characters "(),|?+*" or 0x7 followed by [index], 
or 0x0 (meaning #PCDATA)


	Any Attributes for this Element must be declared next. A different token or 
token-pair is used depending on the type of the attribute. There are forty 
attribute types: the cross section of {REQUIRED, IMPLIED, default value, 
fixed default value} and { 
CDATA,ID,IDREF,IDREFS,ENTITY,ENTITIES,NMTOKEN,NMTOKENS,enumerated , 
enumerated notations}. I have tried to optimize it so the most commonly used 
declaration just take one token, where the most obscure ones take two.
Any Fixed, Default, or enumerated attribute values must be in the String 
Table. The indexes for these below are shown as [fixed index] or [default 
index]. Enumerated type may have any number of index entries, terminated by 
a 0x0. For fixed or Default enumerated types, the first one listed is the 
default.


	REQUIRED_CDATA			0x17, [index]
	IMPLIED_CDATA			0x18, [index]
	FIXED_CDATA				0x19, [index], [fixed index]
	DEFAULT_CDATA			0x1A, [index], [default index]

	REQUIRED_ID				0xC,	0x1, [index]
	IMPLIED_ID				0x1B, [index]
	FIXED_ID				0xC,	0x2, [index], [fixed index]
	DEFAULT_ID				0xC,	0x3, [index], [default index]

	REQUIRED_IDREF			0xC,	0x4, [index]
	IMPLIED_IDREF			0x1C, [index]
	FIXED_IDREF				0xC,	0x5, [index], [fixed index]
	DEFAULT_IDREF			0xC,	0x6, [index], [default index]

	REQUIRED_IDREFS			0xC,	0x7, [index]
	IMPLIED_IDREFS			0x1D, [index]
	FIXED_IDREFS			0xC,	0x8, [index], [fixed index]
	DEFAULT_IDREFS			0xC,	0x9, [index], [default index]

	REQUIRED_ENTITY			0xC,	0xa, [index]
	IMPLIED_ENTITY			0xC,	0xb, [index]
	FIXED_ENTITY			0xC,	0xc, [index], [fixed index]
	DEFAULT_ENTITY			0xC,	0xd, [index], [default index]

	REQUIRED_ENTITIES			0xC,	0xe, [index]
	IMPLIED_ENTITIES			0xC,	0xf, [index]
	FIXED_ENTITIES			0xC,	0x10, [index], [fixed index]
	DEFAULT_ENTITIES			0xC,	0x11, [index], [default index]

	REQUIRED_NMTOKEN			0xC,	0x12, [index]
	IMPLIED_NMTOKEN			0xC,	0x13, [index]
	FIXED_NMTOKEN			0xC,	0x14, [index], [fixed index]
	DEFAULT_NMTOKEN			0xC,	0x15, [index], [default index]

	REQUIRED_NMTOKENS			0xC,	0x16, [index]
	IMPLIED_NMTOKENS			0xC,	0x17, [index]
	FIXED_NMTOKENS			0xC,	0x18, [index], [fixed index]
	DEFAULT_NMTOKENS			0xC,	0x19, [index], [default index]

	REQUIRED_ENUM			0x1E, [index], [value index 1] ... [value index n], 0x00
	IMPLIED_ENUM			0x1F, [index], [value index 1] ... [value index n], 0x00
	FIXED_ENUM				0x1, [index],  [value index 1] ... [value index n], 0x00
	DEFAULT_ENUM			0x2, [index], [default index],  [value index 1] ... [value 
index n], 0x00

	REQUIRED_NOTATIONENUM		0xC,	0x1a, [index], [value index 1] ... [value index 
n], 0x00
	IMPLIED_NOTATIONENUM		0xC,	0x1b, [index], [value index 1] ... [value index 
n], 0x00
	FIXED_NOTATIONENUM		0xC,	0x1c, [index], [fixed index], [value index 1] ... 
[value index n], 0x00
	DEFAULT_NOTATIONENUM		0xC,	0x1d, [index], [default index], [value index 1] 
... [value index n], 0x00

Other things you might see in the Internal DTD Subset:
PUBID and SYSID are just like in section 7, both are optional, and may occur 
in either order.


	NotationDeclaration
		0x14, [index], PUBID?, SYSID?

	PI
		0x15, [index], content

	Comment
		0x16, content

	Internal Entity Decl
		0x12, [index], replacement text
		If you need to embed another entity reference in the replacement text, 
stick in ( 0x13, [index] )

	Entity Reference inside an Entity Decl
		0x13, [index]

	Parsed External Entity Decl
		0xF, [index], PUBID?, SYSID?

	Unparsed External Entity Decl
		0xE, [index], [ndata index], PUBID?, SYSID?

	Interned Cdata
		0x2, [index]
		This may only appear (in the DTD) inside of the content of a comment or PI


Whew!
Not that complicated, but kind of tedious.
I hope there are no tokens which would be ambiguous - if there are, it's an 
error of mine.

Open Questions:
	Should further compression be done for text content?
	Should it be allowed for the string table to be sprinkled throughout the 
document, to make it easier to stream-encode XML?

Feel free to tell me if you think this is crap, I can take it.
Constructive comments are even more welcome.



-Wayne Steele


______________________________________________________
Get Your Private, Free Email at http://www.hotmail.com


***************************************************************************
This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/
***************************************************************************
Follow-Ups:
- RE: binxml proposal
  - From: "gopi" <gopi@aztecsoft.com>
- Re: binxml proposal
  - From: "Joshua E. Smith" <jesmith@kaon.com>
Prev by Date: RE: xml search engine?
Next by Date: Re: (Fwd) Abbreviated Format for XML
Previous by thread: Re: xml search engine?
Next by thread: Re: binxml proposal
Index(es):
- Date
- Thread