[
Lists Home |
Date Index |
Thread Index
]
- From: Joshua Allen <joshuaa@microsoft.com>
- To: "'Bullard, Claude L (Len)'" <clbullar@ingr.com>,'Mike Sharp' <msharp@lante.com>, xml-dev@xml.org
- Date: Wed, 27 Sep 2000 11:29:20 -0700
Actually, there should be no need for the user to manually create a
compression type. The source code for the compressor is available; you will
notice that it simply scans the XML and uses some knowledge derived from
that analysis to "prep" the document for passing through zlib (the freely
available linkable library implementing gzip). So in a sense you could
consider this "gzip on steroids". (Of course, you would have to convert the
proprietary data type to XML to give XMill something to work with, and would
thereby be embedding some hints about the structure of the data, so you are
right about user involvement)
I have no data about using something like this at a transport layer would
do. I would, however, point to what Bruce Martin at IBM did with XBeans,
finding that straight text transmission over the network was significantly
faster than RMI or IIOP when sending XML. Text-based XML was fast
presumably because of gzip-based compression on the wire. Since XMill
essentially uses gzip for the heavy lifting, you could expect the processing
burden to be just about the same, but with better compression. I personally
would have liked to see some COM and Java wrappers for XMill freely
available (not just transmission; think about in-memory caching for
expensive-to-create but infrequently used XML), but I gave up after getting
bogged down in the spaghetti. I think researchers write such sloppy code as
a way to give the rest of us something to feel good about.
> -----Original Message-----
> From: Bullard, Claude L (Len) [mailto:clbullar@ingr.com]
> Sent: Wednesday, September 27, 2000 6:28 AM
> To: Joshua Allen; 'Mike Sharp'; xml-dev@xml.org
> Subject: RE: Binary XML
>
>
> Quoting from the page at http://www.research.att.com/sw/tools/xmill/
>
>
> "XMill is a new tool for compressing XML data efficiently. It
> is based on a
> regrouping strategy that leverages the effect of highly-efficient
> compression techniques in compressors such as gzip. XMill groups XML
> text strings with respect to their meaning and exploits similarities
> between those text strings for compression. Hence, XMill typically
> achieves much better compression rates than conventional
> compressors such as
> gzip.
>
> XML files are typically much larger than the same data represented in
> some reasonably efficient domain-specific data format. One of
> the most
> intriguing results of XMill is that the conversion of
> proprietary data
> formats into XML will in fact improve the compression - i.e. the
> compressed XML file is (up to twice) smaller than the compressed
> original file! And this astonishing compression improvement
> is achieved
> at about the same compression speed."
>
>
> Those are interesting results. Conventional wisdom
> is that the compression of GZIP is sufficient for
> most text based formats. If I understand this
> page, they say that is almost right except that
> where there are regular patterns, one can add a
> compression based on regrouping that substantially
> improves that without loss of speed. Note that in files where the
> ratio of plain text nodes to markup is high
> (lots of text nodes, less markup), the XMill
> strategies are less effective.
>
> So, is it the case that this kind of compression
> is a big helper where the user analyses the
> file in advance and applies a custom compression
> per document type?
>
>
> Len Bullard
> Intergraph Public Safety
> clbullar@ingr.com
> http://www.mp3.com/LenBullard
>
> Ekam sat.h, Vipraah bahudhaa vadanti.
> Daamyata. Datta. Dayadhvam.h
>
>
> -----Original Message-----
> From: Joshua Allen [mailto:joshuaa@microsoft.com]
> Sent: Tuesday, September 26, 2000 8:04 PM
> To: 'Mike Sharp'; xml-dev@xml.org
> Subject: RE: Binary XML
>
>
> The format used for binary tokenisation in WAP is WBXML:
> http://www.w3.org/TR/wbxml/
>
> Dan Suciu and Hartmut Liefke built a compressor specifically
> for XML that
> uses information about the tags to get better compression than normal
> text-oriented compression (such as gzip)
> http://www.research.att.com/sw/tools/xmill/
>
> -J
>
>
> > -----Original Message-----
> > From: Mike Sharp [mailto:msharp@lante.com]
> > Sent: Tuesday, September 26, 2000 3:40 PM
> > To: xml-dev@xml.org
> > Subject: Re: Binary XML
> >
> >
> >
> >
> > A WAP gateway does a binary tokenizing compression bit on the
> > original WML, that
> > results in astonishing compression. Don't know how that
> > applies to your
> > comments, but anecdotally, I've seen (and heard about) pretty
> > good compression
> > simply by using HTTP 1.1 and turning compression on.
> > Obviously, this doesn't
> > help if the XML transport isn't over HTTP (semaphore, anyone?).
> >
> > I'd be curious what people think about it--without, as you
> > say, involving the
> > wire protocol. Is it really necessary to map a specific
> > token to a specific
> > element (for example)? I suppose that it would allow a user
> > to de-tokenize the
> > document, returning it to some semblance of readability. But
> > this could be done
> > in a particular implementation, if needed, by referencing
> > some external document
> > map, couldn't it?
> >
> > Of course, the tokenized XML gets tricky if there are
> > external schemas, DTD's or
> > other XML,..how do you map the elements in the schema to the
> > same elements in
> > the XML, after they've been tokenized? Or did I miss the point...?
> >
> > Curiously,
> > Mike Sharp
> >
> >
> >
> >
> >
> >
> >
> >
> > "Bullard, Claude L (Len)" <clbullar@ingr.com> on 09/26/2000
> > 01:01:00 PM
> >
> > To: xml-dev@xml.org
> > cc: (bcc: Mike Sharp/Lante)
> >
> > Subject: Binary XML
> >
> >
> >
> > Raising an old horse, possibly dead:
> >
> > Has a standard XML binary token set,
> > possibly based on the InfoSet to
> > enable application to different
> > XML vocabularies been created?
> >
> > Or is the thinking still that
> > this side of the wireless protocols,
> > zipping/unzipping is still sufficient
> > given modem support?
> >
> > Len Bullard
> > Intergraph Public Safety
> > clbullar@ingr.com
> > http://www.mp3.com/LenBullard
> >
> > Ekam sat.h, Vipraah bahudhaa vadanti.
> > Daamyata. Datta. Dayadhvam.h
> >
> >
> >
> >
> >
>
|