[
Lists Home |
Date Index |
Thread Index
]
Adding this complete reply to the list for the record. - Dan
-----Original Message-----
From: Winkowski, Daniel
Sent: Monday, March 10, 2003 11:55 PM
To: Elliotte Rusty Harold; msc@mitre.org; xml-dev@lists.xml.org
Cc: winkowski@mitre.org; msc@mitre.org
Subject: RE: [xml-dev] XML Binary and Compression
Rusty,
The conclusions drawn were explicitly caveated by the fact that only one
document type was tested. The binary message set was generated based on a
standard which was designed for narrow bandwidth transmission. I agree that
this paper would not have passed a peer reviewed journal as the dataset is
not generally available and your skepticism is justified. However, I don't
know what you mean by the statement "we can't tell whether the data set used
to produce these results is similar to the sorts of XML data we're working
with or not" since XML documents in the wild exhibit a wide range of
characteristics (flat, deep, structured, unstructured).
The military has been building binary messages optimized for size efficiency
for decades. Our group has been working over the past several years to
express a variety of messages, some based on binary specifications and some
delimitated ASCCI, in XML. In all cases the XML version of these messages
are larger than the original binary or ASCII. Is this surprising? I don't
think so - metadata is not transmitted in either binary or delimitated ASCII
formats. You state that the fact that binary files are smaller than the
equivalent XML is decidedly untrue based on your experience. Quite frankly
this surprises me. Our own experience is just the opposite.
Finally, reference your interest in test set A, the same XML element naming
practice was used in both sets. What really distinguishes set A from B was
that A was created prior to the complete decoded binary sets being
available. In other words set A was an approximation of the binary
represented in XML. Consequently, set A could not be compared against the
binary samples. When the complete binary decoding was found to differ from
the XML representation used in set A the ASN.1 test had already been
conducted and unfortunately they could not be repeated. So set A can not
form the basis for a binary comparison but was used instead to compare the
various encoding/compression techniques against one another.
On reflection, I don't think that the conclusions reached are all that
surprising. Redundancy based compression achieves better results as the file
size, and consequently the amount of redundancy, increases. CODECS that take
advantage of schema knowledge achieve efficient localized encodings and also
need not transmit metadata since this information can be derived at decoding
time. Matching XML documents to the appropriate algorithm can result in
optimizations that rival native binary messages. There is no one size fits
all XML compression/encoding algorithm. Optimization requirements can vary
(speed, memory, document types, streamed decoding or navigability, etc.).
However, just as gzip is an 80% solution for text I hope our study may point
to an 80% solution for XML by matching the available data (XML document,
document characteristics, XML schema if any, and user requirements) with a
matched algorithm. I urge others to follow up on this study with their own
experiments. All the techniques (gzip, ASN.1, XMill, MPEG-7) we used are
openly available with the exception of our WBXML-like (XML Schema aware)
algorithm.
- Dan Winkowski
PS:
FYI, included below is a snippet of an XML instance document with the tags
obfuscated but of the same length. The point being that the element names
are not abbreviated down to two or three letter codes.
<bbb53>
<cccccccccccc54>fe471f81e65b800</cccccccccccc54>
<ddddddddd55>0</ddddddddd55>
<eeeeeee56>0</eeeeeee56>
<ffffffff57>0</ffffffff57>
<gggggggggg58>0</gggggggggg58>
<hhhhhhhh59>701599</hhhhhhhh59>
<iiiiiiiiiiiiiiiii60>36.879941</iiiiiiiiiiiiiiiii60>
<jjjjjjjjjjjjjjjjj61>245.041988</jjjjjjjjjjjjjjjjj61>
<kkkkkkkkkkkkkkkkk62>106652</kkkkkkkkkkkkkkkkk62>
<llllllllll63>0.000000</llllllllll63>
<mmmmmmmmmm64>1800</mmmmmmmmmm64>
<nnnnnnnnnnnnnnnnnnnnnn65>ABC</nnnnnnnnnnnnnnnnnnnnnn65>
<oooooooooo66>357.478638</oooooooooo66>
<ppppppppppp67>0.000000</ppppppppppp67>
<qqqqqqqqqqqqqqqqqqq68>36.669177</qqqqqqqqqqqqqqqqqqq68>
<rrrrrrrrrrrrrrrrrrr69>244.784124</rrrrrrrrrrrrrrrrrrr69>
<sssssssssssssssssssssss70>5.000000</sssssssssssssssssssssss70>
<tttttttttttttttttttttttt71>105</tttttttttttttttttttttttt71>
</bbb53>
> -----Original Message-----
> From: Elliotte Rusty Harold [mailto:elharo@metalab.unc.edu]
> Sent: Monday, March 10, 2003 10:40 AM
> To: msc@mitre.org; xml-dev@lists.xml.org
> Cc: winkowski@mitre.org; msc@mitre.org
> Subject: RE: [xml-dev] XML Binary and Compression
>
>
> At 9:27 AM -0500 3/10/03, msc@mitre.org wrote:
> >Rusty,
> >
> >The corresponding paper can be found here:
> >
> >http://www.idealliance.org/papers/xml02/dx_xml02/papers/06-02
> -04/06-02-04.pd
>
> Thanks. The key point I gather from reading the paper is:
>
> Because of the sensitive nature of the study data, the
> element names used in the sample XML data cannot be
> discussed in this paper. It can be noted, however, that
> the tag names used were unabbreviated, descriptive
> terms.
>
> As mentioned above, the precise structure and content of
> the samples cannot be presented here. However, the
> general structure and data types of the XML documents
> used for the study can be discussed. These are
> illustrated in Figure 1, below. Although the study data
> is not available to the reader, this depiction should
> indicate that the XML sample structure and content is
> sufficiently rich for the study purposes.
>
> In other words the raw data is not available, so it's impossible for
> anybody to independently verify these results. Perhaps more
> importantly, we can't tell whether the data set used to produce these
> results is similar to the sorts of XML data we're working with or
> not. We don't know whether these results would likely be reproducible
> in our own environments.
> --
>
> -----Original Message-----
> From: Elliotte Rusty Harold [mailto:elharo@metalab.unc.edu]
> Sent: Sunday, March 09, 2003 7:06 AM
> To: xml-dev@lists.xml.org
> Cc: winkowski@mitre.org; msc@mitre.org
> Subject: Re: [xml-dev] XML Binary and Compression
>
>
> >Interesting paper from MITRE
> >
> >
> http://www.idealliance.org/papers/xml02/slides/winkowski/winkowski.pdf
> >
> AND ALSO IN REPLY TO
> -----Original Message-----
> From: Elliotte Rusty Harold [mailto:elharo@metalab.unc.edu]
> Sent: Sunday, March 09, 2003 7:06 AM
> To: xml-dev@lists.xml.org
> Cc: winkowski@mitre.org; msc@mitre.org
> Subject: Re: [xml-dev] XML Binary and Compression
>
>
> >Interesting paper from MITRE
> >
> >
> http://www.idealliance.org/papers/xml02/slides/winkowski/winkowski.pdf
> >
>
> Interesting, but there's really not enough information in the
> PowerPoint slides to fairly judge the work. In particular, I'd really
> want to see the actual data they used. They started with the
> assumption that typical binary files were necessarily smaller than
> the equivalent XML, something that is decidedly untrue in my
> experience.
>
> Test set A was fabricated by the authors, and I suspect they paid a
> lot more attention to making it small than anybody actually does in
> practice. Test set B was "derived directly from binary sample data"
> but they don't seem to ever show you what this binary sample data was
> or what its XML encoding was.
>
> I look forward to a more complete paper that provides sufficient
> information to verify and reproduce the results. Will one be
> published anywhere?
|