[
Lists Home |
Date Index |
Thread Index
]
- From: "Bullard, Claude L (Len)" <clbullar@ingr.com>
- To: Joshua Allen <joshuaa@microsoft.com>, 'Mike Sharp' <msharp@lante.com>,xml-dev@xml.org
- Date: Wed, 27 Sep 2000 08:28:04 -0500
Quoting from the page at http://www.research.att.com/sw/tools/xmill/
"XMill is a new tool for compressing XML data efficiently. It is based on a
regrouping strategy that leverages the effect of highly-efficient
compression techniques in compressors such as gzip. XMill groups XML
text strings with respect to their meaning and exploits similarities
between those text strings for compression. Hence, XMill typically
achieves much better compression rates than conventional compressors such as
gzip.
XML files are typically much larger than the same data represented in
some reasonably efficient domain-specific data format. One of the most
intriguing results of XMill is that the conversion of proprietary data
formats into XML will in fact improve the compression - i.e. the
compressed XML file is (up to twice) smaller than the compressed
original file! And this astonishing compression improvement is achieved
at about the same compression speed."
Those are interesting results. Conventional wisdom
is that the compression of GZIP is sufficient for
most text based formats. If I understand this
page, they say that is almost right except that
where there are regular patterns, one can add a
compression based on regrouping that substantially
improves that without loss of speed. Note that in files where the
ratio of plain text nodes to markup is high
(lots of text nodes, less markup), the XMill
strategies are less effective.
So, is it the case that this kind of compression
is a big helper where the user analyses the
file in advance and applies a custom compression
per document type?
Len Bullard
Intergraph Public Safety
clbullar@ingr.com
http://www.mp3.com/LenBullard
Ekam sat.h, Vipraah bahudhaa vadanti.
Daamyata. Datta. Dayadhvam.h
-----Original Message-----
From: Joshua Allen [mailto:joshuaa@microsoft.com]
Sent: Tuesday, September 26, 2000 8:04 PM
To: 'Mike Sharp'; xml-dev@xml.org
Subject: RE: Binary XML
The format used for binary tokenisation in WAP is WBXML:
http://www.w3.org/TR/wbxml/
Dan Suciu and Hartmut Liefke built a compressor specifically for XML that
uses information about the tags to get better compression than normal
text-oriented compression (such as gzip)
http://www.research.att.com/sw/tools/xmill/
-J
> -----Original Message-----
> From: Mike Sharp [mailto:msharp@lante.com]
> Sent: Tuesday, September 26, 2000 3:40 PM
> To: xml-dev@xml.org
> Subject: Re: Binary XML
>
>
>
>
> A WAP gateway does a binary tokenizing compression bit on the
> original WML, that
> results in astonishing compression. Don't know how that
> applies to your
> comments, but anecdotally, I've seen (and heard about) pretty
> good compression
> simply by using HTTP 1.1 and turning compression on.
> Obviously, this doesn't
> help if the XML transport isn't over HTTP (semaphore, anyone?).
>
> I'd be curious what people think about it--without, as you
> say, involving the
> wire protocol. Is it really necessary to map a specific
> token to a specific
> element (for example)? I suppose that it would allow a user
> to de-tokenize the
> document, returning it to some semblance of readability. But
> this could be done
> in a particular implementation, if needed, by referencing
> some external document
> map, couldn't it?
>
> Of course, the tokenized XML gets tricky if there are
> external schemas, DTD's or
> other XML,..how do you map the elements in the schema to the
> same elements in
> the XML, after they've been tokenized? Or did I miss the point...?
>
> Curiously,
> Mike Sharp
>
>
>
>
>
>
>
>
> "Bullard, Claude L (Len)" <clbullar@ingr.com> on 09/26/2000
> 01:01:00 PM
>
> To: xml-dev@xml.org
> cc: (bcc: Mike Sharp/Lante)
>
> Subject: Binary XML
>
>
>
> Raising an old horse, possibly dead:
>
> Has a standard XML binary token set,
> possibly based on the InfoSet to
> enable application to different
> XML vocabularies been created?
>
> Or is the thinking still that
> this side of the wireless protocols,
> zipping/unzipping is still sufficient
> given modem support?
>
> Len Bullard
> Intergraph Public Safety
> clbullar@ingr.com
> http://www.mp3.com/LenBullard
>
> Ekam sat.h, Vipraah bahudhaa vadanti.
> Daamyata. Datta. Dayadhvam.h
>
>
>
>
>
|