OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] Xqueeze: Compact XML Alternative

[ Lists Home | Date Index | Thread Index ]

On Thursday 06 February 2003 11:58 am, you wrote:
> Alaric B. Snell wrote:
> >>* Competition with compression: xqML in it's current format is as
> >>  structured as XML so it too compresses well. In an experiment[1] a
> >>  12 kB HTML document zipped to 2 kB. The (handwritten) BE for the
> >>  same document took 3 kB and when zipped, it took less than 1000
> >>  bytes.
> >
> >Mmmm, it bugs me when people compare gzipped XML with $binary_format. They
> >should compare XML with $binary_format and gzipped XML with gzipepd
> >$binary_format. gzipped $binary_format will, in general, be the smallest
> > of them all, and yet faster to read/write than gzipped XML.

> It doesn't necessarily (or even generally) work that way - compact
> binary formats don't generally compress down as well as text, so you end
> up with size(text) > size(binary) > size(compressed-binary) >
> size(compressed-text).

I've always found that compressed binary is smaller than compressed text, as 
Tahir found. That makes sense logically too; both the binary and text formats 
have the same CDATA in but the binary format has more compact representations 
of the elements and so on.

Of course, one could design binary formats which compress badly, but I've 
never found that they do by default.

You get a better compression *ratio* with text formats since there's more 
redundant information to remove, but that doesn't improve the compression of 
the non-redundant data.

> That seemed to be the case with my XMLS format
> (http://www.sosnoski.com/opensrc/xmls - still on hold, though I hope to
> get back to it soon). One of the oddities of how compression works...

Hmm. gzip works roughly like thus.

1) Use a sliding window algorithm to find repeated substrings and replace 
them with backreferences, expressed as the number of bytes backwards to copy 
from and how many bytes to copy. The output of this is a mixture of literal 
bytes and backreference tokens.

2) Take the stream of literal bytes and represent them by variable-bit-length 
codes. The exact algorithm is a bit complex, but basically, very common 
characters like spaces get codes of a couple of bits length, while uncommon 
characters like * might get codes more than a byte long. Note also that the 
encoding is adaptive - if there are a lot of Xs torwards the beginning of the 
file but a lot of Ys at the end then Ys will get given shorter and shorter 
codes as you delve into the Y-zone.

3) Do the same variable-bit-length trick on the backreference distances and 
lengths.

4) Merge the two strings using a few tricks to seperate backreferences from 
literal data which I can't remember so won't detail here.

So what do we gain here?

1) If a string is used then used again later, the string can be replaced with 
a backreference of a couple of bytes (including the effect of 
variable-bit-length encoding on the backreference which in a tag-rich XML 
document will soon give short codes to the average distance between <X> and 
</X>

2) If certain bytes are a lot more common than others then those bytes will 
be represented by codes a few bits long - in XML, < > ' " = and space might 
get represented in three bits each or so.

3) That sliding window used to look up repeated strings is only so big. I 
don't think it can be more than 32k in gzip from memory.

How does this compare between binary and text compression, then?

Consider:

<invocation>
 <methodName>logTemperatureReadings</methodName>
 <params>
   <int>123</int>
   <int>456</int>
   <int>34234</int>
   <int>2345</int>
   <int>7644</int>
   <int>2342</int>
   <int>63567</int>
   <int>234</int>
   <int>23435..</int>
   ...
 </params>
</invocation>

The adaptive variable-bit-length encoder starts off with an initial 
dictionary suited roughly for English text, IIRC, so <invocation> will 
probably be encoded in around 4-5 bits per character. Likewise with 
<methodName>logTemperatureReadings</, but then it replaces "methodName>" with 
a backreference of a few bytes. <params> comes out as 4-5 bits per char too.

Then it gets interesting. We have a repeated sequence of "   <int>", each of 
which will probably soon be less than a byte of backreference since it will 
keep coming up with backreference distances varying only in the length of the 
integer. Likewise with "</int>\n". So we end up with, say, two bytes of 
overhead per integer param. Then the numbers - the only literal bytes 
occuring during the bulk of the file - will come out at about 4 bits each, 
meaning that for each integer averaging 4 digits long we use up four bytes in 
the compressed stream.

The </params>\n</invocation> bit may, if the file is less than 32k long, 
refer back to "params>" and "invocation>" from before.

A tight binary encoding of this, as generated by PER, would store the 
parameter list as a variable-length number for the number of items to expect 
than a variable-length number per integer. The VLNs for four digit numbers 
will tend to use two bytes each with reasonable entropy in the bottom byte - 
meaning that all 256 values are about equally likely - and the top byte 
typically only covering a restricted range; the full range of two bytes 
(after losing two bits to the variable length encoding) would be from 
0..16383 which the above numbers seem to use only a subset of, meaning that 
the range of bytes used in that block of numbers would be slightly shy of 
uniform, maybe allowing the compressor to squeeze each number into 14 or 15 
bits each if it's lucky.

Now, that's about half the size of the gzipped XML. In this case, the "   
<int>" and "</int>" overheads are mainly squashed away by the sliding window 
compressor, which has very little to work with in the binary case unless 
there is redundant repeated sequences in the actual data. Also, the 
restriction to the character set 0-9 for the numbers means the encoder can 
fit each digit into three or four bits, but in the binary case the encoder 
gets to directly work against the frequency distribution of the underlying 
data.

The moral of this story is that binary formats tend to insert less redundant 
information and inefficient encoding that gets between the compressor and the 
actual information being compressed. If our temperature data happens to have 
an uneven frequency distribution, mainly small integers with a few going up 
above 5,000, the gzip algorithm will not be able to deduce this from the 
decimal encodings of those numbers; all it sees are strings of digits. But in 
binary form, a skewed distribution of integers is accessible to the encoder 
as a skewed distribution of input bytes, that it can capitalise upon.

> David Mertz has done some research in this area - see his article at
> http://www-106.ibm.com/developerworks/library/x-matters13.html Also see

Ok, he brings up bzip - the bzip algorithm is somewhat wierd, but it is 
better at compressing repeated strings like element and attribute names than 
gzip.

He doesn't seem to know that zip and gzip use the same underlying algorithm, 
deflate, which I've roughly summarised above. Oh well. The "improvement" in 
gzip's compression is just because it has less header information, really.

In all of his examples the compressed 'raw' files are smaller than the 
compressed xml files.

His structured XML approach doesn't gain much on hamlet, no, since hamlet's 
mainly content text anyway. No surprises there.

He's still always encoding variations on the attributes and elements though; 
in that Apache weblog, the bytes-transferred will still be written in decimal 
in even the struct form, not as a binary number so the compressor is still 
being crippled!

> James Cheney's paper "Compressing XML with Mulitplexed Hierarchical PPM
> Models" at http://www.cs.cornell.edu/People/jcheney/xmlppm/paper/paper.html

*reads*

Still just working around the encodings of close and start elements and all 
that, though.

> Try a range of documents and see how the compression works out before
> making any claims.

I did; see your quoted references for more that support my statement :-)

Quoting the IBM one:

1764816  weblog.struct
  59955  weblog.struct.bz2
  66994  weblog.xml.bz2
  56156  weblog.txt.bz2
  76183  weblog.struct.gz
 115524  weblog.xml.gz
  59409  weblog.xmi.bz2
  74439  weblog.xmi.gz

The .xml.bz2 is bigger than the .struct.bz2, showing that the binaryish 
format compressed smaller than the XML one.

Try a range of documents and see how the compression works out before making 
any claims :-)

> For compression of XML text bzip2 looks like the best
> choice from what I've seen, so that should probably be the basis for
> comparison.

Yes. It's interesting to see how it varies between different types of XML 
document, too; text-rich (hamlet), tag-rich (weblog, kind of. It has the same 
tag structure repeated over and over again; I'm thinking more of things like 
XSLT here) and data-rich (weblog). IIRC my comparisons involved some large 
Docbook sources, some XSLT, and a data set that was inherently tabular in 
nature like the weblog example.

ABS

-- 
Oh, pilot of the storm who leaves no trace, Like thoughts inside a dream
Heed the path that led me to that place, Yellow desert screen




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS