OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Request: Techniques for reducing the size of XML instances

>From: Al Snell [mailto:alaric@alaric-snell.com]
>>   Seconded.  gzip is simple, fast, ubiquitous, standard, and gives you
>> better compression than any binary substitution scheme ever will.
>Whoa! Crazy incorrect statements alert!

  My statements were neither crazy nor incorrect.  You, sirrah, are a

  gzip is simple, because it's included as a standard library in almost
every language and platform these days.  In Python or Java, for instance,
you can open a gzip stream and treat it like any other stream.  Nobody
normally reimplements it from scratch, though any programmer worth the
name should be able to do it from the spec; it's not rocket science, just
a data structures and pattern matching problem.

  It's faster than almost any other compression scheme that gives
anywhere near the same compression ratio.  There are a few close
contenders, but they fall down in the other factors, especially ubiquity.
You can't beat "already installed" for ease of use.

  I did not state that it was cheap on memory; it's not, but it's also
not a significant factor these days.  RAM is dirt cheap by the MB now.
These are not the '80s any more.  Maybe it's time to upgrade your Apple

>2) gzip won't necessarily give "far better compression". A 
>trivial way of
>getting better compression than gzipping XML is to gzip binary XML.
>deflate is great at compressing the text of content, and if it doesn't
>have redundant whitespace and tag syntax to have to represent 
>it can shave
>off a good few percent extra. Even better would be to take 

  "deflate is great at compressing the text of content" - BUT MARKUP IS

  Not to mention that you're suggesting doing TWO passes of encoding,
which you somehow think will be faster than ONE pass; first replace all
the tags with binary markers, then gzip, instead of just letting gzip
find the repeated sequences on its own...  *Which is what it's designed

  I've done that very experiment, and got marginally better results out
of gzip than I did from pre-encoding the XML.  Binary XML is a dead end,
obsolete before it was even started.  Please stop wasting peoples' time
on it.

>Not that deflate is abad algorithm or zlib a bad 
>implementation, but the
>bit twiddling and block searching required for LZ77 and 
>Huffman encoding
>are awkward operations on von Neuman architectures that happen to have
>byte or word aligned memory access... decoding is better on 
>the LZ77 front
>(it's a heavily read-optimised algorithm), but the Huffman stuff still
>requires bit shuffling because a Huffman data stream is inherently

  <shrug>  Until recently, I'd never met any programmer who couldn't do
bit-twiddling as instinctively as they did addition.  Some of the latest
generation don't have any practice at it, which is a shame, but they
also don't have the inappropriately conservative memory and CPU habits
that us old fogies have.  But it's still very much not rocket science,
and it's only an issue if you're reimplementing it.  Why would you be
reimplementing it?

>But is it worth the bother?

  That's my fundamental question about binary XML, and IMO the answer is:
almost never.  Obviously you disagree, but I think the whole idea is
solving yesterday's problems, instead of focusing in tomorrow's.

>The applications where binary XML 
>are desired
>- high throughput transaction processing systems, low bandwidth
>communications, and low-power processors with small memory - are not
>generally places where complex compression algorithms are 
>worthwhile, with
>the exception of the low bandwidth comms if both ends have CPU 
>cycles to
>spare. For an embedded processor, the costs of parsing textual 
>XML and the
>costs of handling gzip can both be out of the question 
>compared to dealing
>with a very simple binary SAX stream; the storage requirements of this
>over gzipped data are often less important than the speed of 

  Again, this is an area where you're looking back to Elder Days...
Most new embedded systems now have pretty decent CPUs, certainly fast
enough to run gzip if they have to trade off a little speed and RAM to
make up for a slow network pipe.

-- <a href="http://kuoi.asui.uidaho.edu/~kamikaze/"> Mark Hughes </a>