OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Parsing efficiency? - why not 'compile'????

[ Lists Home | Date Index | Thread Index ]

On Thu, 27 Feb 2003 08:53:47 +0000
Alaric Snell wrote:

> On Wednesday 26 February 2003 09:52, Tahir Hashmi wrote:
> > # Tight coupling between schema revisions:
> >
> >   XML is quite resilient to changes in the schema as long as the
> >   changes are done smartly enough to allow old documents to pass
> >   validation through the new schema. This flexibility would be
> >   restricted the greater is the dependence of the binary encoding on
> >   the schema.
> That's not a problem in practice, I think. Say we have a format that works by 
> storing a dictionary of element and attribute names at the beginning of the 
> document (or distributed through it, whenever the name is first encountered, 
> or whatever) and that stores element and attribute text content as a compact 
> binary representation of the type declared in the schema, including a few 
> bits of type declaration in the header for each value.

That's alright, but a per-document data dictionary wouldn't be
suitable for a server dishing out large numbers of very small
documents due to the space overhead. Secondly, the encoder/decoder
will have to build a lookup table in memory for every document. A long
running application loses the opportunity to cache the lookup table in
some high-speed memory and has to go through the process of building
and tearing down lookup tables frequently. That's the reason why I
prefer data dictionaries per _document_type_ since often an instance
of application would deal with a limited set of document types.

> And in this scheme, the encoder is just using the schema as hints on what 
> information it can discard for efficiency. If the schema says that 
> something's an integer, it can drop all aspects of it apart from the integer 
> value by encoding it is a binary number. But if the schema's constriction 
> widens that integer field into an arbitrary string, then it can start 
> encoding as arbitrary strings.

... and the decoder recognizes some fundamental data types which it
can read without referring to the schema - I like this approach :-)

> >   With schema-based compaction done in all the aggressiveness
> >   possible, how much would be gained against a simple markup
> >   binarization scheme? Perhaps a compaction factor of, say, 5 over
> >   XML. Would this be really significant when compared to a factor of,
> >   say, 4 compaction achieved by markup binarization? This is an
> >   optimization issue - the smaller the binary scheme, the more
> >   computation required to extract information out of it. I'm not
> >   totally against a type-aware encoding but for a standard binary
> >   encoding to evolve, it would have to be in a "sweet spot" on the
> >   size vs. computation vs. generality plane.
> Robin was quoting better numbers than these factors of 4 or 5... But even 
> then, I think a bandwidth-limited company would be happy to do a relatively 
> zero-cost upgrade away from textual XML in order to get a fivefold increase 
> in capacity :-)

Exactly! That's what I want to emphasize. The numbers 4 and 5 are not
significant, what's significant is the difference between them. I'd
favour a slightly sub-optimal encoding that's (ideally) as flexible as
XML rather than one which becomes inflexible just to improve a little
more on what's already a significant improvement.

Tahir Hashmi (VSE, NCST)
tahir AT ncst DOT ernet DOT in

We, the rest of humanity, wish GNU luck and Godspeed


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS