[
Lists Home |
Date Index |
Thread Index
]
On Tuesday 26 March 2002 20:02, Mike Champion wrote:
> The first issue is definitely one best handled by compression, but whether
> a generic compression such as gzip or a compression scheme that exploits
> the specific regularity of XML is still debated. Well, at least for
> wireless devices one can make a credible case that an XML-specific
> compression scheme is more efficient of the various limited resources on a
> wireless device. In general, though, you would be well-advised to not try
> to compress arbitrary text better than gzip can. You'll fail.
Ahem! Arithmetic encoding and block sorting come to mind for a start -
combine those two and you can shave off a good few tens of percent over gzip,
IIRC... and even using gzip, you can do better by gzipping the element
content seperately to the XML syntax.
Eg,
<person><name>Alaric</name><email>alaric@alaric-snell.com</email></person>
goes into "person_name_Alaric_alaric@alaric-snell.com", gzipped (the _ is
U+001E, RECORD SEPERATOR - those control characters come in handy!) along
with (preceded by?) a string of packed 3-bit codes, where the possible values
are:
000 - text node; read a string from the data stream up to a U+001E
001 - open element; read the element name from the data stream up to a U+001E
010 - close element
011 - attribute; read the attribute name from the data stream up to a U+001E,
then the attribute value up to a U+001E
100 - processing instruction, also used for <?xml version='1.0'?>; this is a
purely *syntactic* encoding. Content read from data stream.
101 - comment read from data stream
110 - <!DOCTYPE [content read from data stream]>
111 - End of document
As one potential optimisation (gzip has a limited window size, so needs some
hand holding with repeated strings sometimes), you could define that a string
in the data stream of the form 'U+001B' (ESCAPE) followed by a 16 bit network
byte order unsigned integer is considered as a repeat of the string that many
strings ago - this is useful for dealing with element and attribute names and
even some repeated content.
Decoding consists of opening the command and data streams side by side (for
streaming, ideally they would be in two intertwined gzipped streams) and
converting the command stream into SAX events, pulling stuff from the data
stream when required. Encoding consists of converting SAX events to command
stream codes, merging adjacent character events and removing whitespace.
That was just off of the top of my head - there is potential for improvement,
of course.
> The binary XML issue comes up every few months and generates a lot of
> dispute. The "mainstream" position seems to be that XML is really not all
> that hard to parse, the parsers are well-optimized, the overhead of doing
> the byte swapping and other binary format conversion to transfer parsed
> data from one platform to another outweighs any theoretical advantage of
> having a "compiled" form,
Endianness conversion is less of a hassle than converting to and from
ASCII-coded decimal, I would like to note :-)
Endianness conversion is as little as a single instruction on most CPUs,
while converting from base 10 involves... integer multiplication! Looping!
Exception handling! Ew!
> and that the whole issue is a red herring. I
> expect that the holders of the minority view (Hi, Al!)
Hi, Mike! How's the weather? :-)
> will let you know
> their response.
ABS
--
Alaric B. Snell
http://www.alaric-snell.com/ http://RFC.net/ http://www.warhead.org.uk/
Any sufficiently advanced technology can be emulated in software
|