OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Request: Techniques for reducing the size of XML instances

On Thu, 26 Jul 2001, HUGHES,MARK (Non-HP-FtCollins,ex1) wrote:

>   Seconded.  gzip is simple, fast, ubiquitous, standard, and gives you far
> better compression than any binary substitution scheme ever will.
>   After all, gzip compresses both the tags *and* the content, and can
> identify repeated sequences of <tag>content</tag>.  Binary encodings
> can only compress the tags...

Whoa! Crazy incorrect statements alert!

1) gzip is not simple or fast. Have you studied the algorithms required to
implement it or it's memory requirements? You can shave them down by using
a naive brute-force implementation, but then it's slow. And CPU
power/memory/memory bandwidth is *not* cheap in many important application


http://www.gzip.org/zlib/zlib_tech.html - for the default settings, 256kb
of working memory needed for compression and 44k for decompression

Not that deflate is abad algorithm or zlib a bad implementation, but the
bit twiddling and block searching required for LZ77 and Huffman encoding
are awkward operations on von Neuman architectures that happen to have
byte or word aligned memory access... decoding is better on the LZ77 front
(it's a heavily read-optimised algorithm), but the Huffman stuff still
requires bit shuffling because a Huffman data stream is inherently

2) gzip won't necessarily give "far better compression". A trivial way of
getting better compression than gzipping XML is to gzip binary XML.
deflate is great at compressing the text of content, and if it doesn't
have redundant whitespace and tag syntax to have to represent it can shave
off a good few percent extra. Even better would be to take apart the gzip
algorithm - LZ77 splits the input byte stream into a literal data stream
and a historical match stream; instead, it could take a literal data
stream, a syntactic stream of SAX events minus the strings, and a
historical match stream, for the ultimate tuning of the huffman engine.

But is it worth the bother? The applications where binary XML are desired
- high throughput transaction processing systems, low bandwidth
communications, and low-power processors with small memory - are not
generally places where complex compression algorithms are worthwhile, with
the exception of the low bandwidth comms if both ends have CPU cycles to
spare. For an embedded processor, the costs of parsing textual XML and the
costs of handling gzip can both be out of the question compared to dealing
with a very simple binary SAX stream; the storage requirements of this
over gzipped data are often less important than the speed of processing!


                               Alaric B. Snell
 http://www.alaric-snell.com/  http://RFC.net/  http://www.warhead.org.uk/
   Any sufficiently advanced technology can be emulated in software