OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Parsing efficiency? - why not 'compile'????

[ Lists Home | Date Index | Thread Index ]

On Wednesday 26 February 2003 09:52, Tahir Hashmi wrote:

> # Interpreting involved binary constructs could be more difficult:
>   Consider the variable length symbols that I have used in Xqueeze[1]
>   (as also Dennis Sosnoski in XMLS, IIRC). The symbols are easy to
>   understand - unsigned integers serialized as octets in Big-endian
>   order, with the least significant bit of each octet acting as a
>   continuation flag. However, parsing them requires a loop that runs
>   as many times as there are octets in the symbol to read one. Each
>   iteration involves one comparison (check if LSb is 1),
>   multiplication (promotion of the previous octet by 8 bits) and
>   addition (value of the current octet). It's not difficult to see the
>   computation involved in arriving at "Wed Jan 3rd 2003, 14:00 GMT"
>   from a variable length integer that counts the number of seconds
>   since the Epoch[2].

I'm not sure what you're trying to say here. Reading the variable length 
integer from a file would be more efficient than reading the date string and 
converting that to a number of seconds since the epoch, yes?

> # Tight coupling between schema revisions:
>   XML is quite resilient to changes in the schema as long as the
>   changes are done smartly enough to allow old documents to pass
>   validation through the new schema. This flexibility would be
>   restricted the greater is the dependence of the binary encoding on
>   the schema.

That's not a problem in practice, I think. Say we have a format that works by 
storing a dictionary of element and attribute names at the beginning of the 
document (or distributed through it, whenever the name is first encountered, 
or whatever) and that stores element and attribute text content as a compact 
binary representation of the type declared in the schema, including a few 
bits of type declaration in the header for each value.

There is enough information in the binary file to recreate the original XML 
document, modulo the PSVI-canonicalisation of 1.<!--hello-->2 becoming 1.2 
and so on, so the binary reader will be unaffected by any schema changes by 
definition; it doesn't need the schema to decode.

And in this scheme, the encoder is just using the schema as hints on what 
information it can discard for efficiency. If the schema says that 
something's an integer, it can drop all aspects of it apart from the integer 
value by encoding it is a binary number. But if the schema's constriction 
widens that integer field into an arbitrary string, then it can start 
encoding as arbitrary strings.

And when using ASN.1 encodings, they support extensibility in ways XML 
doesn't! When you extend an ASN.1 type, you use an extension marker to denote 
where the extension happened, and the encodings use this information to work 
in a way that means that older versions of the type will still decode 
successfully in higher-version readers, and that lower-version readers can 
still read the parts of higher-version specs that they know about, and in 
both cases the application using the decoder can opt to be warned about the 
version mismatches if it cares. But either way it's not a fatal problem if 
the ASN.1 types change; it's just reported to the applications.

>   With schema-based compaction done in all the aggressiveness
>   possible, how much would be gained against a simple markup
>   binarization scheme? Perhaps a compaction factor of, say, 5 over
>   XML. Would this be really significant when compared to a factor of,
>   say, 4 compaction achieved by markup binarization? This is an
>   optimization issue - the smaller the binary scheme, the more
>   computation required to extract information out of it. I'm not
>   totally against a type-aware encoding but for a standard binary
>   encoding to evolve, it would have to be in a "sweet spot" on the
>   size vs. computation vs. generality plane.

Robin was quoting better numbers than these factors of 4 or 5... But even 
then, I think a bandwidth-limited company would be happy to do a relatively 
zero-cost upgrade away from textual XML in order to get a fivefold increase 
in capacity :-)


Oh, pilot of the storm who leaves no trace, Like thoughts inside a dream
Heed the path that led me to that place, Yellow desert screen


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS