[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
RE: [xml-dev] No XML Binaries? Buy Hardware
- From: noah_mendelsohn@us.ibm.com
- To: "Michael Kay" <mike@saxonica.com>
- Date: Fri, 23 Feb 2007 11:02:10 -0500
Some readers of this thread may be interested in the paper my Research
group published last year at XML 2006, titled: "XML Screamer: An
Integrated Approach to High Performance XML Parsing, Validation and
Deserialization" (Full paper available online at [1]). It makes the case
that there are certain factors that should be considered >quantitatively<
when making statements about which technologies are fast or slow.
For example, if CPU time is the primary concern (as opposed to, say, size
and/or transmission time), then you really need to ask yourself questions
like: how many CPU instructions per input byte are being executed by the
implementation I have in mind, and is that number in some sense
reasonable? What we found in the case of XML was that a lot of people
were running around making statements like "regular text based XML is too
slow for application X", and then you'd ask them questions like, "what
parser are you using and with what API?" They might answer: Xerces with
SAX feeding some Web Services deserializeer. Well, when you look at what
a processor like that is doing, the answer is that it's executing hundreds
of instructions, on average per input byte. Now you ask why, and you
find out that some of that overhead is inherent in what seem to us the
best possible approaches (e.g. it seems essential to do at least some form
of comparison on each byte of input if you are to check well formedness),
but much of the overhead comes from things like doing UTF-8 to UTF-16
conversion of tags, many of which are just again string compared (in their
long UTF-16 form!) after SAX hands them up to the application or
deserializer. With better APIs, you can extract the necesary information
from text XML much, much faster. On the other hand, for other
applications you may really need SAX or DOM. The point is to measure both
binary and text against the particular applications of interest, and using
APIs representative of what you'd deploy to optimize each in that context.
The point of the above example was that it was not just XML itself that
was causing the overhead: it was XML along with the particular choice of
APIs and processing layers, perhaps in some cases aggravated by
implementations that just weren't as careful as they might have been.
Indeed, one of the main reasons that Xpat is faster, though still not as
fast as we managed to go in our experiments, is that it passes strings
around in the native encoding of the input document. Also: I've got
nothing against SAX. It's fine as a standard for interoperation at
medium speed. The fact that it's so much faster than most DOM's has led
to the misapprehension that it's not a performance bottleneck relative to
what XML can do. In many contexts, it is a bottleneck.
Am I saying Binary XML is a bad idea? Not at all, though I've said that
I'm unconvinced that standardizing a single binary form of XML is the
right thing to do. I am saying that there's a lot of misinformation out
there about what really leads to good or bad performance either for
regular text XML or for particular binary flavors. You can take the
"best" (by whatever metric) binary XML in the world, force your
application through a sub-optimal API, and your performance may be
limited. You can easily obscure the true differences between the
approaches.
Actually, I believe that a careful, quantitative analysis will show that
particular binary forms are indeed much faster for certain applications,
especially if the APIs are tuned right. There's no question that, for
example, checking end tags is slower than not having to check end tags.
The fact that alignments in XML are variable tends to slow things relative
to formats in which counts are sent as naturally aligned integers
(especially if you luck out and sender and receiver agree on byte order.)
That's because almost every modern processor is much faster at loading an
aligned number than at working through unaligned characters. It has to do
with how the memory and cache hierarchies are built. It's also true that
binary formats, even those that aren't schema aware, tend to be able to
use string pools and string handles: comparing integer handles is almost
always much faster than doing string compares. That sad fact is that most
XML implementations, and for that matter many binary XML implementations,
are so sub optimal at this point that those factors are being hidden by
other unnecessary overhead. The resulting comparisons between XML and
Binary are noisy at best.
Now, whether the true extra overhead of text XML is really significant
after you finish optimizing it well is a different question. Deploying a
good binary XML implementation onto lots of platforms will take lots of
work. Tuning XML implementations super-well will take lots of work. When
you're done, I do believe the binary will be somewhat, occasionally
dramatically faster for many purposes. Whether the difference will be
significant given the overhead in the rest of the application, given
particular choices of API, etc. will depend on your application. I think
the answer will be "yes" in selected important applications, and "no" in
many others.
The main point of this note is to suggest that these questions need to be
considered quantitatively, and with the sort of low level tests and
benchmarks that allow you to account for the instructions your processor
is executing. I'm somewhat tired of hearing about Java implementations of
XML (or binary) that are slow, bit for which nobody can say whether the
JIT is doing a good job of inlining. In such cases, you don't know
whether you're measuring XML or a deficient Java optimzier. You may not
know whether your JIT is doing the same job on both technologies, because
optimizers are notoriously sensitve to details of particular applications.
To really know what you've got, you've to get into the running code and
see what machine code the JIT has produced (we actually did that, but we
found it to be such a pain that we publicly reported mainly our C language
results, for which checking the machine code is much easier.)
Anyway, I hope the paper is of interest. We had fun doing the work.
Noah
[1] http://www2006.org/programme/item.php?id=5011
--------------------------------------
Noah Mendelsohn
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]