Lists Home |
Date Index |
At 9:50 am -0500 11/3/03, Elliotte Rusty Harold wrote:
>At 11:54 PM -0500 3/10/03, firstname.lastname@example.org wrote:
>>The conclusions drawn were explicitly caveated by the fact that only one
>>document type was tested. The binary message set was generated based on a
>>standard which was designed for narrow bandwidth transmission. I agree that
>>this paper would not have passed a peer reviewed journal as the dataset is
>>not generally available and your skepticism is justified. However, I don't
>>know what you mean by the statement "we can't tell whether the data set used
>>to produce these results is similar to the sorts of XML data we're working
>>with or not" since XML documents in the wild exhibit a wide range of
>>characteristics (flat, deep, structured, unstructured).
>It's not that complex. I have my documents that I'm interested. You have yours. Walter perry has his. Robin Berjon has his. They are similar in some respects and dissimilar in others. Your results may be applicable to my needs (or Walter's, or Robin's or other peoples) or they may not, depending on how closely the formats you're measuring map to the documents we use. However, since we can't look at your documents there's no way for us to tell. We simply don't know whether your results are meaningful in our environment or not.
You could try reproducing the experiment with your own data :-) That's
what I did. I can't release my test data either, because it contains
personal information and financial information that cannot easily be sanitised
without destroying the validity of the results (I'm just interested in the
maximum compression available from readily available tools for bulk,
structurally repetitive data in a real high volume application).
For the record, I used a 1.3Mb file of structurally repetitive, but otherwise
variable "real world" XML data. Each repetition occupies roughly 1.1kbs and
is moderately structured (elements nested maybe three to four deep in places)
with tag names chosen to be readable rather than terse. The bulk of the data
is monetary values (expressed to two decimal places) and personal id info
(names, DoBs, id numbers).
Gzip -9 (ie best compression) reduces the dataset to 5.3% of its original
size. Xmill -9 reduces the dataset to 3.48% of its original size. The ability
to get roughly 50% more data into a given bandwidth is not to be sneezed at,
especially given an initial starting point of a near 20-fold reduction in
I know this probably doesn't help you with your own data any more than
the original paper did, but I trust someone may find this small endorsement
of the Xmill approach useful...
Andy Greener Mob: +44 7836 331933
GID Ltd, Reading, UK Tel: +44 118 956 1248
email@example.com Fax: +44 118 958 9005