[
Lists Home |
Date Index |
Thread Index
]
Apologies for restarting this thread. I've just returned from my
vacation, and I'm working my way through a lot of e-mail that built
up. Having read this entire thread now. there's one issue I noticed
that's been feinted at a couple of times, but nobody seems to have
taken it head-on. So please allow me to do that now.
One of the goals of some of the developers pushing binary XML is to
speed up parsing, to provide some sort of preparsed format that is
quicker to parse than real XML. I am extremely skeptical that this
can be achieved in a platform-independent fashion. Possibly some of
the ideas for writing length codes into the data might help, though I
doubt they help that much, or are robust in the face of data that
violates the length codes. Nonetheless this is at least plausible.
However, this is not the primary preparsing of XML I've seen in
existing schemes. A much more common approach assigns types to the
data and then writes the data into the file as a binary value that
can be directly copied to memory. For example, an integer might be
written as a four-byte big-endian int. A floating point number might
be written as an eight-byte IEEE-754 double, and so forth. This might
speed up things a little in a few cases. However, it's really only
going to help on those platforms where the native types match the
binary formats. On platforms with varying native binary types, it
might well be slower than performing string conversions.
Unicode decoding is a related issue. It's been suggested that this is
a bottleneck in existing parsers, and that directly encoding Unicode
characters instead of UTF code points might help. However, since in a
binary format you're shipping around bytes, not characters, it's not
clear to me how this encoding would be any more efficient than
existing encodings such as UTF-8 and UTF-16. If you just want 32-bit
characters then use UTF-32. Possibly you could gain some speed by
slamming bytes into the native string or wstring type (UTF-16 for
Java, possibly other encodings for other languages.) However, as with
numeric types this would be very closely tied to the specific
language. What worked well for Java might not work well for C or Perl
and vice versa.
Nonetheless it should be doable. A Java parser that worked directly
on UTF-16 code points and did not directly decode characters should
be able to be implemented. Verifying the well-formedness of surrogate
pairs might be more expensive, but is rarely needed in practice. I
think this could be fully implemented within the bounds of XML 1.0. I
don't see why a new serialization format would be necessary to remove
this bottleneck from the process.
In summary, I am very skeptical that any prepared format which
accepts schema-invalid documents is going to offer significant
speedups across different platforms and languages. I do not accept as
an axiom that binary formats are naturally faster to parse than text
formats. Possibly this can be proved by experiment, but I tend to
doubt it.
--
Elliotte Rusty Harold
elharo@metalab.unc.edu
Processing XML with Java (Addison-Wesley, 2002)
http://www.cafeconleche.org/books/xmljava
http://www.amazon.com/exec/obidos/ISBN%3D0201771861/cafeaulaitA
|