OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "Binary XML" proposals

On Tue, 10 Apr 2001, Tim Bray wrote:

> So, Sean may have used strong language, but in point of fact
> he was correct, so it's forgivable.  Get some data on how
> much space and time a binary representation will save, then
> you'll be able to make intelligent quantitative decisions 
> on where it's worthwhile deploying it.

Well, the encoding I am considering will fit a document into a number of
bytes that can be calculated thus:

1) Count the number of discrete namespace URIs, attribute names, PI
targets, and element names in the document. The same element name under
two different namespaces counts as the *same* element name for this
purpose. Add the number of bytes (UTF-8) in all of these names (don't
include namespace prefixes on names), plus two per name (one for the byte
tag saying "this is a symbol def", one for the NUL terminator).

2) Count the number of processing instructions. For each PI, allocate
seven bytes (tag + 16 bit symbol number for PI target name + 32 bit
content length) plus the number of bytes required to encode the string
inside the PI.

3) Count the number of start-elements. Allocate five bytes each (1 byte
tag, 16 bit namespace symbol ID, 16 bit element name ID).

4) Count the number of end-elements. Allocate a byte each.

5) Count the number of spans of CDATA, including whitespace (for now we'll
assume all whitespace is significant rather than looking in DTDs of
DSLs). Allocate five bytes (tag byte + 32 bit length) plus the length of
the data (expand all character entity references to UTF-8!) per CDATA.

6) Count the number of attributes, and allocate for each one byte tag, 16
bits of namespace ID, 16 bits of name ID, 32 bits of length, and then the
size of the string in UTF-8

I won't bother with the rules for entities for now...

> Until then, it's just amusing speculation.  -Tim

Everything has to start with speculation :-) But as things stand there are
numerous proprietary or domain-specific binary XML hacks appearing,
presumably because people feel that text-encoded XML is not efficient
enough. Even if they are wrong, it would be good to offer a lightning
conductor for that wrongness in a standardised binary encoding with a
decent and widely available set of tools rather than having it proliferate
behind the skirting boards, no?


                               Alaric B. Snell
 http://www.alaric-snell.com/  http://RFC.net/  http://www.warhead.org.uk/
   Any sufficiently advanced technology can be emulated in software