Lists Home |
Date Index |
On Thursday 06 February 2003 11:58 am, you wrote:
> Alaric B. Snell wrote:
> >>* Competition with compression: xqML in it's current format is as
> >> structured as XML so it too compresses well. In an experiment a
> >> 12 kB HTML document zipped to 2 kB. The (handwritten) BE for the
> >> same document took 3 kB and when zipped, it took less than 1000
> >> bytes.
> >Mmmm, it bugs me when people compare gzipped XML with $binary_format. They
> >should compare XML with $binary_format and gzipped XML with gzipepd
> >$binary_format. gzipped $binary_format will, in general, be the smallest
> > of them all, and yet faster to read/write than gzipped XML.
> It doesn't necessarily (or even generally) work that way - compact
> binary formats don't generally compress down as well as text, so you end
> up with size(text) > size(binary) > size(compressed-binary) >
I've always found that compressed binary is smaller than compressed text, as
Tahir found. That makes sense logically too; both the binary and text formats
have the same CDATA in but the binary format has more compact representations
of the elements and so on.
Of course, one could design binary formats which compress badly, but I've
never found that they do by default.
You get a better compression *ratio* with text formats since there's more
redundant information to remove, but that doesn't improve the compression of
the non-redundant data.
> That seemed to be the case with my XMLS format
> (http://www.sosnoski.com/opensrc/xmls - still on hold, though I hope to
> get back to it soon). One of the oddities of how compression works...
Hmm. gzip works roughly like thus.
1) Use a sliding window algorithm to find repeated substrings and replace
them with backreferences, expressed as the number of bytes backwards to copy
from and how many bytes to copy. The output of this is a mixture of literal
bytes and backreference tokens.
2) Take the stream of literal bytes and represent them by variable-bit-length
codes. The exact algorithm is a bit complex, but basically, very common
characters like spaces get codes of a couple of bits length, while uncommon
characters like * might get codes more than a byte long. Note also that the
encoding is adaptive - if there are a lot of Xs torwards the beginning of the
file but a lot of Ys at the end then Ys will get given shorter and shorter
codes as you delve into the Y-zone.
3) Do the same variable-bit-length trick on the backreference distances and
4) Merge the two strings using a few tricks to seperate backreferences from
literal data which I can't remember so won't detail here.
So what do we gain here?
1) If a string is used then used again later, the string can be replaced with
a backreference of a couple of bytes (including the effect of
variable-bit-length encoding on the backreference which in a tag-rich XML
document will soon give short codes to the average distance between <X> and
2) If certain bytes are a lot more common than others then those bytes will
be represented by codes a few bits long - in XML, < > ' " = and space might
get represented in three bits each or so.
3) That sliding window used to look up repeated strings is only so big. I
don't think it can be more than 32k in gzip from memory.
How does this compare between binary and text compression, then?
The adaptive variable-bit-length encoder starts off with an initial
dictionary suited roughly for English text, IIRC, so <invocation> will
probably be encoded in around 4-5 bits per character. Likewise with
<methodName>logTemperatureReadings</, but then it replaces "methodName>" with
a backreference of a few bytes. <params> comes out as 4-5 bits per char too.
Then it gets interesting. We have a repeated sequence of " <int>", each of
which will probably soon be less than a byte of backreference since it will
keep coming up with backreference distances varying only in the length of the
integer. Likewise with "</int>\n". So we end up with, say, two bytes of
overhead per integer param. Then the numbers - the only literal bytes
occuring during the bulk of the file - will come out at about 4 bits each,
meaning that for each integer averaging 4 digits long we use up four bytes in
the compressed stream.
The </params>\n</invocation> bit may, if the file is less than 32k long,
refer back to "params>" and "invocation>" from before.
A tight binary encoding of this, as generated by PER, would store the
parameter list as a variable-length number for the number of items to expect
than a variable-length number per integer. The VLNs for four digit numbers
will tend to use two bytes each with reasonable entropy in the bottom byte -
meaning that all 256 values are about equally likely - and the top byte
typically only covering a restricted range; the full range of two bytes
(after losing two bits to the variable length encoding) would be from
0..16383 which the above numbers seem to use only a subset of, meaning that
the range of bytes used in that block of numbers would be slightly shy of
uniform, maybe allowing the compressor to squeeze each number into 14 or 15
bits each if it's lucky.
Now, that's about half the size of the gzipped XML. In this case, the "
<int>" and "</int>" overheads are mainly squashed away by the sliding window
compressor, which has very little to work with in the binary case unless
there is redundant repeated sequences in the actual data. Also, the
restriction to the character set 0-9 for the numbers means the encoder can
fit each digit into three or four bits, but in the binary case the encoder
gets to directly work against the frequency distribution of the underlying
The moral of this story is that binary formats tend to insert less redundant
information and inefficient encoding that gets between the compressor and the
actual information being compressed. If our temperature data happens to have
an uneven frequency distribution, mainly small integers with a few going up
above 5,000, the gzip algorithm will not be able to deduce this from the
decimal encodings of those numbers; all it sees are strings of digits. But in
binary form, a skewed distribution of integers is accessible to the encoder
as a skewed distribution of input bytes, that it can capitalise upon.
> David Mertz has done some research in this area - see his article at
> http://www-106.ibm.com/developerworks/library/x-matters13.html Also see
Ok, he brings up bzip - the bzip algorithm is somewhat wierd, but it is
better at compressing repeated strings like element and attribute names than
He doesn't seem to know that zip and gzip use the same underlying algorithm,
deflate, which I've roughly summarised above. Oh well. The "improvement" in
gzip's compression is just because it has less header information, really.
In all of his examples the compressed 'raw' files are smaller than the
compressed xml files.
His structured XML approach doesn't gain much on hamlet, no, since hamlet's
mainly content text anyway. No surprises there.
He's still always encoding variations on the attributes and elements though;
in that Apache weblog, the bytes-transferred will still be written in decimal
in even the struct form, not as a binary number so the compressor is still
> James Cheney's paper "Compressing XML with Mulitplexed Hierarchical PPM
> Models" at http://www.cs.cornell.edu/People/jcheney/xmlppm/paper/paper.html
Still just working around the encodings of close and start elements and all
> Try a range of documents and see how the compression works out before
> making any claims.
I did; see your quoted references for more that support my statement :-)
Quoting the IBM one:
The .xml.bz2 is bigger than the .struct.bz2, showing that the binaryish
format compressed smaller than the XML one.
Try a range of documents and see how the compression works out before making
any claims :-)
> For compression of XML text bzip2 looks like the best
> choice from what I've seen, so that should probably be the basis for
Yes. It's interesting to see how it varies between different types of XML
document, too; text-rich (hamlet), tag-rich (weblog, kind of. It has the same
tag structure repeated over and over again; I'm thinking more of things like
XSLT here) and data-rich (weblog). IIRC my comparisons involved some large
Docbook sources, some XSLT, and a data set that was inherently tabular in
nature like the weblog example.
Oh, pilot of the storm who leaves no trace, Like thoughts inside a dream
Heed the path that led me to that place, Yellow desert screen