OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: almost four years ago....



At 4:09 PM +0100 6/16/01, Alaric Snell wrote:
>This is easy to do. GZIP is massively crippled by having no information about
>the structure of the file - it's just a string of bytes that it has to make
>some assumptions about the probable structure of with regards to frequency
>distributions that won't even apply very well to XML; it's trivial to write
>something that compresses better, especially if you use gzip for 
>what it's best
>at (the CDATA) and handle the <> bits yourself.
>

I've heard that one before too. In practice, it isn't nearly as easy 
as people think it is. After a great deal of effort, you may be be 
able to shrink 1% or 2% more on some files. However, most people who 
try this end up producing something that is noticeably larger than 
gzip.

Of course you could use a better general purpose compression 
algorithm. bzip can grab you 5% or so a lot of the time, though it 
isn't as widely supported. Frankly, if you can't provide at least a 
10% improvement then it's not worth my time to worry about.

Better than 10% smaller, I don't think you can do without a lossy 
algorithm. You simply run into the limits of information theory.

>>  3. Human legible/human editable data doesn't matter.
>
>Indeed, we must never use image files, filesystems, or gzip - they'll never
>take off :-)
>

This is a canard. Nobody uses XML for this stuff anyway.

>>  All three beliefs have been empirically proven false time and time
>>  again.
>
>Chuckle!
>

Hey, don't let me stop you from trying! I could be wrong, in which 
case we can all benefit from your efforts. But I think that if you're 
really smart and try really hard and devote months of your life to 
this problem, you aren't even going to get a 10% improvement over 
gzip. (You might not get any improvement at all.) And even if you do 
get that 10% improvement, I suspect you'll discover you're system is 
so inconvenient compared to plain or gzipped XML that nobody will use 
it. But after all, it's your life. If you've got the time to spend on 
this, feel free to try. I'm just afraid you'll get the same results 
as the last two dozen people who tried this.
-- 

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
|                  The XML Bible (IDG Books, 1999)                   |
|              http://metalab.unc.edu/xml/books/bible/               |
|   http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://metalab.unc.edu/javafaq/ |
|  Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/     |
+----------------------------------+---------------------------------+