[
Lists Home |
Date Index |
Thread Index
]
On Sat, 07 Dec 2002 08:02:03 +0000, Sean McGrath <sean.mcgrath@propylon.com>
wrote:
> I've given a lot of thought recently to what it is about data typing
> in XML and Binary XML that makes me so nervous. What follows is my
> most concerted attempt at articulating what causes me to be
> so nervous an a suggestion for how we might proceed.
Sheesh, you're insightful You ought to write a book :-)
> Simply put, there is nothing wrong with Binary XML within the confines
> of an application. It is a very useful optimization which can and
> should be treated as a "compiler". You would never throw away your
> source code having passed it through a compiler. The same should be
> the case with your XML. It is the portable representation of your data
> just like the source files are the portable version of your machine
> code.
I think that's a great analogy. Some alternative syntaxes for "XML
infoset serialization" (not to be confused with "XML" of course) would be
in addition to the XML source code, for convenience/efficiency.
They wouldn't in any sense replace the source code; "losing the source
code" would be a disaster, and failing to provide the source code
on demand would be a horrible breach of interop-etiquitte.
>
> If they end
> up using strongly typed "compiled" XML to get around this, they will have
> tightly bound their XML to their process which is a bad thing.
I agree with the "typed" bit, and I completely agree that tightly binding
application-specific datatypes to shared data puts us back in the Bad Old
Days.
> Standardized, marshallings of XML (XML infoset compilers) for Java, .NET
> etc.
> need to be done so that the notion of binary XML is both catered for
> and COMPREHENSIVELY RELEGATED to the realm of "compiled"
> output. Something you just use for optimization reasons but NEVER use
> as primary storage for your data.
Hmm, I have some minor quibbles, or perhaps need clarification. I do see
some *potential* reasons for a standardized "efficiient"
Infoset interchange format that doesn't include platform-specific binary
formats or application- specific datatypes. To extend the source code /
object code analogy, it might be thought of as P-code. It would simply
serialize an infoset in a way that makes it significantly easier to parse
than the UnicodeWithAngleBrackets we know and love.
Basically, one might start from the XPath data model (or some successor)
and come up
with a vendor/platform/language-neutral format that serializes it in a way
that is faster to parse, based on actual experience in where XML parsers
spend their time,
I don't have any solutions to propose, but some problems that such a P-code
might solve could include:
- The inefficiency of resolving namespaces. I
- Normalizing Unicode characters. - Resolving entities, CDATA sections,
and other syntax sugar. (I'm not sure
what to do about < > & but I suspect that there are creative
solutions
possible) - Buffer management, string rewriting, object creation. I know
that these
are signficant bottlenecks in most parsers. I don't know offhand how a
serialization
format could help break them, but I know that this is a big reason
that people who write high-performance SOAP processors drew a line in the
sand and
refused to allow DTDs and all the cruft they bring along into SOAP
messages.
Maybe a P-code that pre-resolves the "cruft" might be parsable
significantly faster than XML can be parsed.
Sorry to go on so long with this possibly ill-founded brainstorm, but this
is the kind of thing
I think many advocates of "binary" XML are talking about, and it is not
necessarily
tied up with application-specific datatypes or platform-specific numeric
formats.
>
> I suggest we make one core twist to XML. Lets express the various layers
> to XML parsing in terms of a pipeline and see if it can help
> us accommodate the date typing folk, the binary XML folk etc. without
> throwing out the baby with the bathwater.
Yes! I think that the "pipeline processing" metaphor could provide a Gestalt
shift that puts more of the stuff that seems to divide the XML community
into a common set of tools/operations; the different communities
use a different set of these, or combine them in different ways, without
getting in each others' way.
|