OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   RE: Binary XML

[ Lists Home | Date Index | Thread Index ]
  • From: "Bullard, Claude L (Len)" <clbullar@ingr.com>
  • To: Joshua Allen <joshuaa@microsoft.com>, 'Mike Sharp' <msharp@lante.com>,xml-dev@xml.org
  • Date: Wed, 27 Sep 2000 08:28:04 -0500

Quoting from the page at http://www.research.att.com/sw/tools/xmill/


"XMill is a new tool for compressing XML data efficiently. It is based on a 
regrouping strategy that leverages the effect of highly-efficient 
compression techniques in compressors such as gzip. XMill groups XML 
text strings with respect to their meaning and exploits similarities 
between those text strings for compression. Hence, XMill typically 
achieves much better compression rates than conventional compressors such as
gzip. 

XML files are typically much larger than the same data represented in 
some reasonably efficient domain-specific data format. One of the most 
intriguing results of XMill is that the conversion of proprietary data 
formats into XML will in fact improve the compression - i.e. the  
compressed XML file is (up to twice) smaller than the compressed 
original file! And this astonishing compression improvement is achieved 
at about the same compression speed." 


Those are interesting results.  Conventional wisdom 
is that the compression of GZIP is sufficient for 
most text based formats.  If I understand this 
page, they say that is almost right except that 
where there are regular patterns, one can add a 
compression based on regrouping that substantially  
improves that without loss of speed.  Note that in files where the 
ratio of plain text nodes to markup is high 
(lots of text nodes, less markup), the XMill 
strategies are less effective.

So, is it the case that this kind of compression 
is a big helper where the user analyses the 
file in advance and applies a custom compression 
per document type?


Len Bullard
Intergraph Public Safety
clbullar@ingr.com
http://www.mp3.com/LenBullard

Ekam sat.h, Vipraah bahudhaa vadanti.
Daamyata. Datta. Dayadhvam.h


-----Original Message-----
From: Joshua Allen [mailto:joshuaa@microsoft.com]
Sent: Tuesday, September 26, 2000 8:04 PM
To: 'Mike Sharp'; xml-dev@xml.org
Subject: RE: Binary XML


The format used for binary tokenisation in WAP is WBXML:
http://www.w3.org/TR/wbxml/

Dan Suciu and Hartmut Liefke built a compressor specifically for XML that
uses information about the tags to get better compression than normal
text-oriented compression (such as gzip)
http://www.research.att.com/sw/tools/xmill/

-J


> -----Original Message-----
> From: Mike Sharp [mailto:msharp@lante.com]
> Sent: Tuesday, September 26, 2000 3:40 PM
> To: xml-dev@xml.org
> Subject: Re: Binary XML
> 
> 
> 
> 
> A WAP gateway does a binary tokenizing compression bit on the 
> original WML, that
> results in astonishing compression.  Don't know how that 
> applies to your
> comments, but anecdotally, I've seen (and heard about) pretty 
> good compression
> simply by using HTTP 1.1 and turning compression on.  
> Obviously, this doesn't
> help if the XML transport isn't over HTTP (semaphore, anyone?).
> 
> I'd be curious what people think about it--without, as you 
> say,  involving the
> wire protocol.  Is it really necessary to map a specific 
> token to a specific
> element (for example)?  I suppose that it would allow a user 
> to de-tokenize the
> document, returning it to some semblance of readability.  But 
> this could be done
> in a particular implementation, if needed, by referencing 
> some external document
> map, couldn't it?
> 
> Of course, the tokenized XML gets tricky if there are 
> external schemas, DTD's or
> other XML,..how do you map the elements in the schema to the 
> same elements in
> the XML, after they've been tokenized?  Or did I miss the point...?
> 
> Curiously,
> Mike Sharp
> 
> 
> 
> 
> 
> 
> 
> 
> "Bullard, Claude L (Len)" <clbullar@ingr.com> on 09/26/2000 
> 01:01:00 PM
> 
> To:   xml-dev@xml.org
> cc:    (bcc: Mike Sharp/Lante)
> 
> Subject:  Binary XML
> 
> 
> 
> Raising an old horse, possibly dead:
> 
> Has a standard XML binary token set,
> possibly based on the InfoSet to
> enable application to different
> XML vocabularies been created?
> 
> Or is the thinking still that
> this side of the wireless protocols,
> zipping/unzipping is still sufficient
> given modem support?
> 
> Len Bullard
> Intergraph Public Safety
> clbullar@ingr.com
> http://www.mp3.com/LenBullard
> 
> Ekam sat.h, Vipraah bahudhaa vadanti.
> Daamyata. Datta. Dayadhvam.h
> 
> 
> 
> 
> 




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS