OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   The privilege of XML parsing - Data types, binary XML and XML pipelines

[ Lists Home | Date Index | Thread Index ]

I've given a lot of thought recently to what it is about data typing
in XML and Binary XML that makes me so nervous. What follows is my
most concerted attempt at articulating what causes me to be
so nervous an a suggestion for how we might proceed.

Executive summary

Welding data typing into the core of XML is a really bad
idea and it is well on the way to splitting the XML world in two which
is a very bad thing.

Binary XML is a terrible idea that must be embraced before it does
terrible damage.

The best way to deal with these - and other thornies like namespace
expansion, xlink, Xinclude etc. is to infuse the concept
of "XML processing" from the lexical to the application, with a
pipeline processing architecture before we all go completely ga ga or
stop talkng to each other or both.

Data Types:

In order for any two systems to communicate they need to have a shared
understanding of at least one, "bootstrap" data type for the bag of
ones and zeros that ultimately goes across the wire.

It is a self evident fact that decades of computing have failed to
produce a set of universal datatypes - types that one can reasonable
expect to be commodity datatypes on most architectures, most
programming languages most databases etc.

Originally the only universal datatype was 1 or 0. Then came the
universality of ASCII. Now we are seeing ASCII evolve into Unicode.

The nearest thing we have to a universal datatype is Unicode - or in
programming language terms - the STRING datatype.

The wonderful thing about strings - apart from their universality - is
the fact that you can use a universal string based notation to
represent pretty much any higher order datatype. Programming
languages have used this fact to great effect over the years by
storing programs as STRINGS. These days, there is a general
purpose notation for sharing higher order datatypes in a universal way
That notation is XML.

XML did not - and should not now be allowed to - fall into the trap of 
the existence of a universal set of datatypes for the following reasons:

1. No such set of datatypes exists. The world is full of systems that
have only a thematic consensus on things like "int", "date" and so on.
Datatypes for aggregates likes "person" or "business" have proven to
be essentially impossible to canonicalize.

2. Applications come and go but data lives forever (or can do). The
trick of making your data outlive your applications is to divorce
application-level data models from the XML. Burying data model
information into the XML binds the XML to the application in a way
which will bite when the application is changed or retired.

3. Doing so significantly increases the semantic consensus required by
communicating processes to share data. The beauty of *HAVING* to
create your own data model[1] from a stream of Unicode with angle
brackets is that you do not have to share any semantics or
expectations other than Unicode with the originator of that XML. Far
from being a burden, it is a *privilege* to be able to parse the XML
and treat the data the way you want to, rather than have a data model
imposed on you.

4. There is no need to infuse this right into the core of XML - it
fits perfectly naturally into a post-parse,
application-domain-specific pipeline, which is where it belongs. Mind you, 
who think there interoperability problems will be solved by agreeing a set of
basic datatypes are sorely mistaken.

Binary XML:

I use binary XML every day. My OpenOffice files are binary (zipped
XML), my serialized pyxie trees (Python pickles) are binary. My RDBS
that contain XML fragments are binary. I often send messages over MOMs
that contain XML + Python pickle.

Simply put, there is nothing wrong with Binary XML within the confines
of an application. It is a very useful optimization which can and
should be treated as a "compiler". You would never throw away your
source code having passed it through a compiler. The same should be
the case with your XML. It is the portable representation of your data
just like the source files are the portable version of your machine

A standarized, zipped XML notation is something the community
needs to think about (perhaps in the context of packaging) because, many
programmers see XML transmission size as a problem. If they end
up using strongly typed "compiled" XML to get around this, they will have
tightly bound their XML to their process which is a bad thing.

Standardized, marshallings of XML (XML infoset compilers) for Java, .NET etc.
need to be done so that the notion of binary XML is both catered for
and COMPREHENSIVELY RELEGATED to the realm of "compiled"
output. Something you just use for optimization reasons but NEVER use
as primary storage for your data.

Pipeline processing:

I think we can keep peace amongst the data heads and the doc heads,
the infoset heads and the lex heads etc. I think the way to do it is
to infuse XML parsing with a layered, phased, time ordered processing
model so that data typining, xincluding etc. can be incorporated into
a single, flexible framework. Those who don't want infoset annotation
should be able to leave it out of parsing by simply configuring the

This is where XPipe, DSDL, XVIF etc. are coming from. (Note that in
the sense I am using the word "pipe" here, the W3C XML Pipeline Note
is more of a dependency resolver than a pipe.)

I have not had the time to devote to XPipe that I would have liked but
I'm a big believer in XML pipelining. My company, Propylon, is about to 
announce a
commercial J2EE implementation of XPipe which I'm hoping will fuel
interest in the open source community in this approach to XML
processing. (Anybody going to XML 2002 in Baltimore interested in
seeing this can contact me.)


I suggest we make one core twist to XML. Lets express the various layers
to XML parsing in terms of a pipeline and see if it can help
us accommodate the date typing folk, the binary XML folk etc. without
throwing out the baby with the bathwater.

The baby is that an XML document is *always* just a Unicode string to
start with. It is the worlds' only, universally available, bootstrappable data
type apart from 1's and 0's.


My use of data model here is, I have come to realize, at odds with Tim
Bray's usage of the term which is, I think, the reason we have in the
past disagreed on the data model issue.

Given my predisposition to pipeline thinking, I am acutely interested
in being able to build black box XML processing components whose only
interface to the outside world is that universal datatype we all know
and love - UnicodeWithAngleBrackets.

Without a mechanism for specifying what parts of the infoset are
preserved thorough such a black box, it is difficult to know what
boxes can be hooked up to what other boxes. Interop suffers and complexity

When I say I want a data model for XML, I mean that I want to be able
to say rigorously what parts of the lexical structure my black box
sees on its input side, can be faithfully replicated on the
output side.

In a pipelined word, this would equate to pre and post-conditions on
pipeline components that express the infoset fidelity of the component.



News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS