OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Parsing efficiency? - why not 'compile'????

[ Lists Home | Date Index | Thread Index ]

On Thu, 27 Feb 2003 11:02:41 +0100
Robin Berjon wrote:

> Tahir Hashmi wrote:
> > Let me modify Karl's assumption a little:
> > 
> >   Let's assume we /now have/ a binary XML specification [snip],
> >   everything basically the same, just binary streaming format, but
> >   same Infoset, same APIs /as/ for reporting XML content.
> > 
> > And again ask these questions:
> > 
> >   What would be the difference? For the programmer? For the platforms?
> (note that your question is a bit flawed as we already have standard 
> specifications for binary infosets.)

I didn't get it... I mean, isn't a binary substitute what we're trying
to develop? I'm not talking about API, I'm talking about syntax or
serialization or whatever - the thing that can be stored in files or
passed down the wire.
> You basically have two groups of people:
>    - those that don't need it. For them, it'll make no difference. They wouldn't 
> use it. This is not the WXS type of technology that dribbles its way through 
> many others.
>    - those that do need it. These folks will be able to use XML where they 
> couldn't before. And when I say XML, I mean AngleBracketedUnicode. Conversion to 
> binary will only happen in the steps where it is needed so that most of what 
> those people will see will be actual XML.

In the first group, there could be a subgroup that doesn't need binary
markup but may use it simply because it can, without affecting the way
its applications work. That's the group that doesn't need human
read/write-ability for its XML docs - the group of WYSIWYG Office
suites, XML-based instant messaging protocols and so on. I hope not
all the people in this group would be same as those described by
Elliot ;-)

> > # Interpreting involved binary constructs could be more difficult:
> Errr... I really am not sure what you mean, notably by "involved binary 
> constructs". I think you can distinguish between two situations: a) the 
> application wants a date, in which case seconds since the Epoch or a time_t 
> struct might be exactly what it wants, it'll be cheaper than strptime(3) for 
> sure; b) the application wants a string containing a date in which case you're 
> free to store dates as strings in your binary format.

Consider this: the application is only interested in strings for date
but the schema designer specified a date type because it is the Right
Thing(TM) for a date (so that the schema need not be changed if at some
point of time the same application or another application does get
interested in the value).

In a binary representation, the processor will decode the variable
length binary value to arrive at the number of seconds since epoch,
then re-construct a string for the application. Note that the
processor will be *synthesizing* a string that could be read straight
off the document.

This approach would be better only if the benefits of saved bandwidth
are greater than the cost of synthesizing the date string. And we
can't assume that limited bandwidth is *always* going to be the
motivating factor for using binary markup.

> > # Forced validation:
> > 
> >   The above situation would be even more ironic if the application
> >   didn't care about the actual value of the date and was only
> >   interested in some string that looked like a date. With XML
> >   validation of data types is an option that is being enforced as a
> >   requirement in the above scheme. Even where validation is required,
> >   how far can a parser validate? A value may be syntactically or
> >   semantically acceptable but contextually invalid (lame e.g. - a date
> >   of birth being in the future). My point: validation is and should
> >   remain an option.
> This is completely orthogonal to the subject.

This may not be *completely* orthogonal. In the cited case, despite
the date string being typed as date, the application is free to ignore
the value by chosing to not validate it. In strongly typed encoding,
the decoder does type-checking implicitly and takes the pains to
compute a meaningful value whether or not the application required it.

The particular example I gave is illustrative only and as stated
earlier, I'm not against type-awareness. I'm simply being wary of how
much flexibility might possibily be lost, and in some cases
computation be wasted, in the quest of a super-optimized binary

> As for your remark on the speed of decompaction, note that you may be right for 
> a naive implementation of the same thing but there's compsci literature out 
> there on making such tasks fast.

Well yes, naivete may lead to bad design. The point is that more the
logic that goes into decoding a format, the higher the bar for small
devices is raised. While one can have small non-validating SAX parsers
for XML, the size of a binary format parser may go up since it would
have to know about synthesizing dates from integers, deducing document
structure from the schema etc, besides the indispensible passing of
strings around. The encoding scheme should require least possible
context information and minimal parsing logic to be accessible
there. Hope I'm able to explain myself better this time!

Tahir Hashmi (VSE, NCST)
tahir AT ncst DOT ernet DOT in

We, the rest of humanity, wish GNU luck and Godspeed


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS