OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Compiled XML

[ Lists Home | Date Index | Thread Index ]

On Tuesday 26 March 2002 20:02, Mike Champion wrote:

> The first issue is definitely one best handled by compression, but whether
> a generic compression such as gzip or a compression scheme that exploits
> the specific regularity of XML is still debated.  Well, at least for
> wireless devices one can make a credible case that an XML-specific
> compression scheme is more efficient of the various limited resources on a
> wireless device.  In general, though, you would be well-advised to not try
> to compress arbitrary text better than gzip can.  You'll fail.

Ahem! Arithmetic encoding and block sorting come to mind for a start - 
combine those two and you can shave off a good few tens of percent over gzip, 
IIRC... and even using gzip, you can do better by gzipping the element 
content seperately to the XML syntax.

goes into "person_name_Alaric_alaric@alaric-snell.com", gzipped (the _ is 
U+001E, RECORD SEPERATOR - those control characters come in handy!) along 
with (preceded by?) a string of packed 3-bit codes, where the possible values 

000 - text node; read a string from the data stream up to a U+001E
001 - open element; read the element name from the data stream up to a U+001E
010 - close element
011 - attribute; read the attribute name from the data stream up to a U+001E, 
then the attribute value up to a U+001E
100 - processing instruction, also used for <?xml version='1.0'?>; this is a  
      purely *syntactic* encoding. Content read from data stream.
101 - comment read from data stream
110 - <!DOCTYPE [content read from data stream]>
111 - End of document

As one potential optimisation (gzip has a limited window size, so needs some 
hand holding with repeated strings sometimes), you could define that a string 
in the data stream of the form 'U+001B' (ESCAPE) followed by a 16 bit network 
byte order unsigned integer is considered as a repeat of the string that many 
strings ago - this is useful for dealing with element and attribute names and 
even some repeated content.

Decoding consists of opening the command and data streams side by side (for 
streaming, ideally they would be in two intertwined gzipped streams) and 
converting the command stream into SAX events, pulling stuff from the data 
stream when required. Encoding consists of converting SAX events to command 
stream codes, merging adjacent character events and removing whitespace.

That was just off of the top of my head - there is potential for improvement, 
of course.

> The binary XML issue comes up every few months and generates a lot of
> dispute. The "mainstream" position seems to be that XML is really not all
> that hard to parse, the parsers are well-optimized, the overhead of doing
> the byte swapping and other binary format conversion to transfer parsed
> data from one platform to another outweighs any theoretical advantage of
> having a "compiled" form,

Endianness conversion is less of a hassle than converting to and from 
ASCII-coded decimal, I would like to note :-)

Endianness conversion is as little as a single instruction on most CPUs, 
while converting from base 10 involves... integer multiplication! Looping! 
Exception handling! Ew!

> and that the whole issue is a red herring.  I
> expect that the holders of the minority view (Hi, Al!)

Hi, Mike! How's the weather? :-)

> will let you know
> their response.


                               Alaric B. Snell
 http://www.alaric-snell.com/  http://RFC.net/  http://www.warhead.org.uk/
   Any sufficiently advanced technology can be emulated in software  


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS