OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Parsing efficiency? - why not 'compile'????

[ Lists Home | Date Index | Thread Index ]

Tahir Hashmi wrote:
> Robin Berjon wrote:
>>It would be horrible. Quite simply horrible. But then, it would never have taken 
>>off so we wouldn't be discussing it.
> Let me modify Karl's assumption a little:
>   Let's assume we /now have/ a binary XML specification [snip],
>   everything basically the same, just binary streaming format, but
>   same Infoset, same APIs /as/ for reporting XML content.
> And again ask these questions:
>   What would be the difference? For the programmer? For the platforms?

(note that your question is a bit flawed as we already have standard 
specifications for binary infosets.)

You basically have two groups of people:

   - those that don't need it. For them, it'll make no difference. They wouldn't 
use it. This is not the WXS type of technology that dribbles its way through 
many others.

   - those that do need it. These folks will be able to use XML where they 
couldn't before. And when I say XML, I mean AngleBracketedUnicode. Conversion to 
binary will only happen in the steps where it is needed so that most of what 
those people will see will be actual XML.

> Extreme optimization based on the knowledge of Schema might be
> unattractive because:
> # Interpreting involved binary constructs could be more difficult:
>   Consider the variable length symbols that I have used in Xqueeze[1]
>   (as also Dennis Sosnoski in XMLS, IIRC). The symbols are easy to
>   understand - unsigned integers serialized as octets in Big-endian
>   order, with the least significant bit of each octet acting as a
>   continuation flag. However, parsing them requires a loop that runs
>   as many times as there are octets in the symbol to read one. Each
>   iteration involves one comparison (check if LSb is 1),
>   multiplication (promotion of the previous octet by 8 bits) and
>   addition (value of the current octet). It's not difficult to see the
>   computation involved in arriving at "Wed Jan 3rd 2003, 14:00 GMT"
>   from a variable length integer that counts the number of seconds
>   since the Epoch[2].

Errr... I really am not sure what you mean, notably by "involved binary 
constructs". I think you can distinguish between two situations: a) the 
application wants a date, in which case seconds since the Epoch or a time_t 
struct might be exactly what it wants, it'll be cheaper than strptime(3) for 
sure; b) the application wants a string containing a date in which case you're 
free to store dates as strings in your binary format.

> # Forced validation:
>   The above situation would be even more ironic if the application
>   didn't care about the actual value of the date and was only
>   interested in some string that looked like a date. With XML
>   validation of data types is an option that is being enforced as a
>   requirement in the above scheme. Even where validation is required,
>   how far can a parser validate? A value may be syntactically or
>   semantically acceptable but contextually invalid (lame e.g. - a date
>   of birth being in the future). My point: validation is and should
>   remain an option.

This is completely orthogonal to the subject.

> # Tight coupling between schema revisions:
>   XML is quite resilient to changes in the schema as long as the
>   changes are done smartly enough to allow old documents to pass
>   validation through the new schema. This flexibility would be
>   restricted the greater is the dependence of the binary encoding on
>   the schema. (I still have to reach XML's level of compatibility in
>   Xqueeze Associations (data dictionary). Fortunately, achieving that
>   wouldn't require changes in the grammar of the encoding).

This is a solved problem in BinXML, multiple versions of the same schema can 

> # What is gained in the end?
>   With schema-based compaction done in all the aggressiveness
>   possible, how much would be gained against a simple markup
>   binarization scheme? Perhaps a compaction factor of, say, 5 over
>   XML. Would this be really significant when compared to a factor of,
>   say, 4 compaction achieved by markup binarization? This is an
>   optimization issue - the smaller the binary scheme, the more
>   computation required to extract information out of it. I'm not
>   totally against a type-aware encoding but for a standard binary
>   encoding to evolve, it would have to be in a "sweet spot" on the
>   size vs. computation vs. generality plane.

I'm all for finding a sweet spot but pulling random numbers out of a hat and 
making broad assumptions about size vs computation won't contribute much in 
getting there. I am talking about empirically proven, tested, retested, put to 
work in a wide variety of situations, factors of 10, 20 or 50 (or more, but 
testing on SOAP is cheating ;).

As for your remark on the speed of decompaction, note that you may be right for 
a naive implementation of the same thing but there's compsci literature out 
there on making such tasks fast.

Robin Berjon <robin.berjon@expway.fr>
Research Engineer, Expway        http://expway.fr/
7FC0 6F5F D864 EFB8 08CE  8E74 58E6 D5DB 4889 2488


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS