OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] The privilege of XML parsing - Data types,binary XML and X

[ Lists Home | Date Index | Thread Index ]

On Sat, 07 Dec 2002 08:02:03 +0000, Sean McGrath <sean.mcgrath@propylon.com> 

> I've given a lot of thought recently to what it is about data typing
> in XML and Binary XML that makes me so nervous. What follows is my
> most concerted attempt at articulating what causes me to be
> so nervous an a suggestion for how we might proceed.

Sheesh, you're insightful  You ought to write a book :-)

> Simply put, there is nothing wrong with Binary XML within the confines
> of an application. It is a very useful optimization which can and
> should be treated as a "compiler". You would never throw away your
> source code having passed it through a compiler. The same should be
> the case with your XML. It is the portable representation of your data
> just like the source files are the portable version of your machine
> code.

I think that's a great analogy.  Some alternative syntaxes for "XML
infoset serialization" (not to be confused with "XML" of course) would be 
in addition to the XML source code, for convenience/efficiency.
They wouldn't in any sense replace the source code; "losing the source 
code" would be a disaster, and failing to provide the source code
on demand would be a horrible breach of interop-etiquitte.

> If they end
> up using strongly typed "compiled" XML to get around this, they will have
> tightly bound their XML to their process which is a bad thing.

I agree with the "typed" bit, and I completely agree that tightly binding
application-specific datatypes to shared data puts us back in the Bad Old 

> Standardized, marshallings of XML (XML infoset compilers) for Java, .NET 
> etc.
> need to be done so that the notion of binary XML is both catered for
> and COMPREHENSIVELY RELEGATED to the realm of "compiled"
> output. Something you just use for optimization reasons but NEVER use
> as primary storage for your data.

Hmm, I have some minor quibbles, or perhaps need clarification. I do see 
some *potential* reasons for a standardized "efficiient"
Infoset interchange format that doesn't include platform-specific binary 
formats or application- specific datatypes.  To extend the source code / 
object code analogy, it might be thought of as P-code.  It would simply 
serialize an infoset in a way that makes it significantly easier to parse 
than the  UnicodeWithAngleBrackets we know and love.

Basically, one might start from the XPath data model (or some successor) 
and come up
with a vendor/platform/language-neutral format that serializes it in a way 
that is faster to parse, based on actual experience in where XML parsers 
spend their time,
I don't have any solutions to propose, but some problems that such a P-code
might solve could include:

- The inefficiency of resolving namespaces.  I
- Normalizing  Unicode characters.  - Resolving entities, CDATA sections, 
and other syntax sugar. (I'm not sure
what to do about  &lt; &gt; &amp; but I suspect that there are creative 
possible) - Buffer management, string rewriting, object creation. I know 
that these
are signficant bottlenecks in most parsers.  I don't know offhand how a 
format could help break them, but I know that this is a big reason
that people who write high-performance SOAP processors drew a line in the 
sand and
refused to allow DTDs and all the cruft they bring along into SOAP 
Maybe a P-code that pre-resolves the "cruft" might be parsable 
significantly faster than XML can be parsed.

Sorry to go on so long with this possibly ill-founded brainstorm, but this 
is the kind of thing
I think many advocates of "binary" XML are talking about, and it is not 
tied up with application-specific datatypes or platform-specific numeric 

> I suggest we make one core twist to XML. Lets express the various layers
> to XML parsing in terms of a pipeline and see if it can help
> us accommodate the date typing folk, the binary XML folk etc. without
> throwing out the baby with the bathwater.

Yes!  I think that the "pipeline processing" metaphor could provide a Gestalt 
 shift that puts more of the stuff that seems to divide the XML community 
into a common set of tools/operations; the different communities
use a different set of these, or combine them in different ways, without
getting in each others' way.


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS