xml-dev - RE: [xml-dev] XML Binary and Compression

RE: [xml-dev] XML Binary and Compression
[ Lists Home | Date Index | Thread Index ]
To: "[public XML-DEV]" <xml-dev@lists.xml.org>
Subject: RE: [xml-dev] XML Binary and Compression
From: "Alessandro Triglia" <sandro@mclink.it>
Date: Thu, 13 Mar 2003 11:15:10 -0500
Importance: Normal


> -----Original Message-----
> From: winkowski@mitre.org [mailto:winkowski@mitre.org]
> Sent: Thursday, March 13, 2003 09:18
> To: elharo@metalab.unc.edu; xml-dev@lists.xml.org
> Cc: winkowski@mitre.org; msc@mitre.org
> Subject: RE: [xml-dev] XML Binary and Compression
> 
> 
> Hmm, I'm sorry you don't think schema-based encoding is fair.
> I find it odd that you regard schema-based (encoding) 
> compression as lossy. This term is normally associated with a 
> permanent loss of information. Neither ASN.1 or MPEG-7 result 
> in the loss of XML content (the original content did not of 
> course contain the XML schema). The deployment of the schema 
> upon which encoding/decoding is based in a management issue. 
> There is no need to transmit it as part of the encoded content.


I generally agree.  I think the point is what portions or features of an
XML document are being regarded as essential in a given context
(conveying useful information to the intended consumer) and what are
not.

If I have a file that happens to contain an XML document and
compress/decompress it with a LZ algorithm, I will get back a file that
is identical to the original file, byte-by-byte.  Is this always a
requirement?  It may be a requirement *sometimes*, but I doubt it is a
requirement in most of the cases.

If I devise an XML-aware compression algorithm, some "loss" of literal
contents will usually be acceptable:  kind and length of whitespace
inside tags; kind of quotes around attribute values; use of numeric
character references; use of empty-element tags vs. start-tag/end-tag
pairs with nothing in between; and so on.  This list can grow
considerably.  By exploiting its knowledge of XML *and* by regarding
such syntactic "information" as unessential, an XML-aware compression
algorithm can achieve better performance than a generic compression
algorithm.  Strictly speaking, such an algorithm would be "lossy", but
the loss would only affect "information" that is being considered
unessential by the parties that agree on using the algorithm.

At some point, the list of "unessential" features will have grown so
much that the XML syntax does not matter anymore and we can switch to
referring to the infoset.  But even then, there are many properties in
an infoset that many applications would regard as "unessential" - such
as the way namespace prefixes are used; the precise set of in-scope
namespace declarations; whether each attribute was defaulted from the
DTD; and so on.  An infoset-based compression algorithm could choose to
not represent such information, if the parties are happy.

Farther along this route is the use of schemas.  Using schemas to
improve the compression rate lies on the assumptions (which are
effectively an agreement between the parties) that:  (1) the schema is
known by the parties;  (2) the XML document is well-formed and valid
according to the schema.  Additional assumptions may be: (3) when
multiple lexical representations of the same datatype value are
available, it does not matter which one was used in the original XML
document; (4) any DTD present in the original document can be discarded
after the document has been processed for well-formedness, including the
expansion of (internal) entity references.

As more and more assumptions are made, and as more and more
"information" is considered unessential, and as more and more essential
information (such as typing information) is factored out of the document
(and represented separately in a more efficient manner), the size of a
compressed representation is likely to decrease.  

I would say that the whole issue is about what the parties are willing
to agree on.  Some applications will want the XML document untouched.
Other applications will be happy with an attribute value of, say,
"3.45e-2"  being changed to  "0.0345"  along the way.  The latter will
benefit from much higher compression rates.

Alessandro Triglia
OSS Nokalva


> 
> - Dan
> 
> > -----Original Message-----
> > From: Elliotte Rusty Harold [mailto:elharo@metalab.unc.edu]
> > Sent: Tuesday, March 11, 2003 10:01 AM
> > To: winkowski@mitre.org; msc@mitre.org; xml-dev@lists.xml.org
> > Cc: winkowski@mitre.org; msc@mitre.org
> > Subject: RE: [xml-dev] XML Binary and Compression
> > 
> > 
> > At 11:54 PM -0500 3/10/03, winkowski@mitre.org wrote:
> > 
> > >On reflection, I don't think that the conclusions reached
> > are all that
> > >surprising. Redundancy based compression achieves better
> > results as the file
> > >size, and consequently the amount of redundancy, increases.
> > CODECS that take
> > >advantage of schema knowledge achieve efficient localized
> > encodings and also
> > >need not transmit metadata since this information can be
> > derived at decoding
> > >time.
> > 
> > I may have missed something in your paper then, because I didn't
> > realize you were doing this. If you're assuming that the 
> same schema 
> > is available for both compression and decompression, then you're 
> > doing a lossy compression. The conmpressed forms of your documents 
> > have less information in them than the uncompressed forms. I don't 
> > consider that to be a fair or useful comparison with  raw XML with 
> > metadata present.
> > 
> > Then again, maybe that's not what you meant? If you're somehow
> > embedding a schema  in the document you transmit, then it's really 
> > just another way of compressing losslessly and that's OK, though In 
> > would still require that the schema used for compression be derived 
> > from the instance documents rather than applied pre facto under the 
> > assumption of document validity. Hmmm, that's not quite 
> right. What I 
> > really mean is that given a certain schema it must be possible to 
> > losslessly encode both valid and invalid documents.
> > -- 
> > 
> > 
> +-----------------------+------------------------+-------------------+
> > | Elliotte Rusty Harold | elharo@metalab.unc.edu | 
> Writer/Programmer |
> > 
> +-----------------------+------------------------+-------------------+
> > |           Processing XML with Java (Addison-Wesley, 2002) 
>          |
> > |              http://www.cafeconleche.org/books/xmljava    
>          |
> > | http://www.amazon.com/exec/obidos/ISBN%3D0201771861/cafeaulaitA  |
> > 
> +----------------------------------+---------------------------------+
> > |  Read Cafe au Lait for Java News:  
> http://www.cafeaulait.org/      |
> > |  Read Cafe con Leche 
> for XML News: http://www.cafeconleche.org/    |
> > 
> +----------------------------------+---------------------------------+
> > 
> 
> 
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org 
> <http://www.xml.org>, an initiative of OASIS 
<http://www.oasis-open.org>

The list archives are at http://lists.xml.org/archives/xml-dev/

To subscribe or unsubscribe from this list use the subscription
manager: <http://lists.xml.org/ob/adm.pl>
Follow-Ups:
- Re: [xml-dev] XML Binary and Compression
  - From: John Cowan <jcowan@reutershealth.com>
Prev by Date: RE: [xml-dev] XML Binary and Compression
Next by Date: Re: [xml-dev] XML Binary and Compression
Previous by thread: RE: [xml-dev] XML Binary and Compression
Next by thread: Re: [xml-dev] XML Binary and Compression
Index(es):
- Date
- Thread