OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[xml-dev] Whitespace handling when compressing XML with WInZIP



I have an XML file with most of the data at a tag nesting level of 5-7,
and am compressing it with WINZip 8.

While not touching the markup, I've found the whitespace used
to "step out" the tags has a big impact on the compression ratio
achievable.


Using 4 chars per nesting level, the file is 6.7 MB and compresses
to 154Kb.

Using  1 tab per nesting level, the file is 4.2 MB and compresses
to 98 Kb.

While the change in the input file is what I'd anticipated, I would have
expected the zip file to be roughly the same size. Wouldn't
repeated runs of the same character  be replaced with a single token,
whether they be spaces or tabs?

Am I missing something obvious?  Does anyone understand the
internals of WINZIP enough to explain the descrepancy? The ratio
of the compressed files suggests its encoding the multiple spaces
with multiple tokens rather than a single token.

Thanks
Michael

P.S.

I reran the tests with XMill . It achieved a compressed file size of
60Kb regardless of the change in whitespace. My content data
does have a high degree of redundancy within identically names
elements which would help it.




------------------------------------------
This e-mail is confidential.  If you are not the intended recipient, any use, disclosure or copying of this document is unauthorised and prohibited.  If you have received this document in error, please delete the email and notify me by return email or by phoning the NEMMCO Helpdesk on 1300 300 295.