OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Parsing efficiency? - why not 'compile'????

[ Lists Home | Date Index | Thread Index ]

On Tue, 25 Feb 2003 15:59:27 +0100
Robin Berjon wrote:

> Karl Waclawek wrote:
> > Let's assume we would have had a binary XML specification from
> > the beginning, everything basically the same, just binary streaming format,
> > but same Infoset, same APIs for reporting XML content.
> > What would be the difference? For the programmer? For the platforms?
> It would be horrible. Quite simply horrible. But then, it would never have taken 
> off so we wouldn't be discussing it.

Let me modify Karl's assumption a little:

  Let's assume we /now have/ a binary XML specification [snip],
  everything basically the same, just binary streaming format, but
  same Infoset, same APIs /as/ for reporting XML content.

And again ask these questions:

  What would be the difference? For the programmer? For the platforms?

> Binary XML is a contradiction in adjecto. That's why I'm anti-binxml: simply 
> because there is no such thing as "Binary XML". Binary Infosets however are 
> another story completely, and much more interesting :)

True, there's no such thing as "Binary XML". Let's say, we're talking
about "Binary XML-like Markup" condensed to "binary markup". One step
to create a binary markup scheme is to replace the terminals in XML
Grammar (which are essentially combinations of Unicode characters)
with some other form. Binary infosets may not necessarily be binary
markup. They're just serialization of some data structure.

Extreme optimization based on the knowledge of Schema might be
unattractive because:

# Interpreting involved binary constructs could be more difficult:

  Consider the variable length symbols that I have used in Xqueeze[1]
  (as also Dennis Sosnoski in XMLS, IIRC). The symbols are easy to
  understand - unsigned integers serialized as octets in Big-endian
  order, with the least significant bit of each octet acting as a
  continuation flag. However, parsing them requires a loop that runs
  as many times as there are octets in the symbol to read one. Each
  iteration involves one comparison (check if LSb is 1),
  multiplication (promotion of the previous octet by 8 bits) and
  addition (value of the current octet). It's not difficult to see the
  computation involved in arriving at "Wed Jan 3rd 2003, 14:00 GMT"
  from a variable length integer that counts the number of seconds
  since the Epoch[2].

# Forced validation:

  The above situation would be even more ironic if the application
  didn't care about the actual value of the date and was only
  interested in some string that looked like a date. With XML
  validation of data types is an option that is being enforced as a
  requirement in the above scheme. Even where validation is required,
  how far can a parser validate? A value may be syntactically or
  semantically acceptable but contextually invalid (lame e.g. - a date
  of birth being in the future). My point: validation is and should
  remain an option.

# Tight coupling between schema revisions:
  XML is quite resilient to changes in the schema as long as the
  changes are done smartly enough to allow old documents to pass
  validation through the new schema. This flexibility would be
  restricted the greater is the dependence of the binary encoding on
  the schema. (I still have to reach XML's level of compatibility in
  Xqueeze Associations (data dictionary). Fortunately, achieving that
  wouldn't require changes in the grammar of the encoding).

# What is gained in the end?

  With schema-based compaction done in all the aggressiveness
  possible, how much would be gained against a simple markup
  binarization scheme? Perhaps a compaction factor of, say, 5 over
  XML. Would this be really significant when compared to a factor of,
  say, 4 compaction achieved by markup binarization? This is an
  optimization issue - the smaller the binary scheme, the more
  computation required to extract information out of it. I'm not
  totally against a type-aware encoding but for a standard binary
  encoding to evolve, it would have to be in a "sweet spot" on the
  size vs. computation vs. generality plane.

[1] http://xqueeze.sourceforge.net
[2] http://www.alaric-snell.com/xml-dev-threads.html#binxml

PS: I've revised the xqML specifications to allow document parsing
without the knowledge of schema and other goodies. I'll release a
draft spec. on Monday (3rd March) when my vacation gets over. Random
access would be addressed shortly thereafter. :-)

Tahir Hashmi
tahir AT ncst DOT ernet DOT in

We, the rest of humanity, wish GNU luck and Godspeed


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS