RE: [xml-dev] No XML Binaries? Buy Hardware

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
From: noah_mendelsohn@us.ibm.com
To: "Michael Kay" <mike@saxonica.com>
Date: Fri, 23 Feb 2007 11:02:10 -0500
Some readers of this thread may be interested in the paper my Research 
group published last year at XML 2006, titled:  "XML Screamer: An 
Integrated Approach to High Performance XML Parsing, Validation and 
Deserialization" (Full paper available online at [1]).  It makes the case 
that there are certain factors that should be considered >quantitatively< 
when making statements about which technologies are fast or slow.

For example, if CPU time is the primary concern (as opposed to, say, size 
and/or transmission time), then you really need to ask yourself questions 
like:  how many CPU instructions per input byte are being executed by the 
implementation I have in mind, and is that number in some sense 
reasonable?    What we found in the case of XML was that a lot of people 
were running around making statements like "regular text based XML is too 
slow for application X", and then you'd ask them questions like, "what 
parser are you using and with what API?" They might answer:  Xerces with 
SAX feeding some Web Services deserializeer.  Well, when you look at what 
a processor like that is doing, the answer is that it's executing hundreds 
 of instructions, on average per input byte.  Now you ask why, and you 
find out that some of that overhead is inherent in what seem to us the 
best possible approaches (e.g. it seems essential to do at least some form 
of comparison on each byte of input if you are to check well formedness), 
but much of the overhead comes from things like doing UTF-8 to UTF-16 
conversion of tags, many of which are just again string compared (in their 
long UTF-16 form!) after SAX hands them up to the application or 
deserializer.  With better APIs, you can extract the necesary information 
from text XML much, much faster.  On the other hand, for other 
applications you may really need SAX or DOM.  The point is to measure both 
binary and text against the particular applications of interest, and using 
APIs representative of what you'd deploy to optimize each in that context.

The point of the above example was that it was not just XML itself that 
was causing the overhead:  it was XML along with the particular choice of 
APIs and processing layers, perhaps in some cases aggravated by 
implementations that just weren't as careful as they might have been. 
Indeed, one of the main reasons that Xpat is faster, though still not as 
fast as we managed to go in our experiments, is that it passes strings 
around in the native encoding of the input document.  Also:  I've got 
nothing against SAX.   It's fine as a standard for interoperation at 
medium speed.  The fact that it's so much faster than most DOM's has led 
to the misapprehension that it's not a performance bottleneck relative to 
what XML can do.  In many contexts, it is a bottleneck.

Am I saying Binary XML is a bad idea? Not at all, though I've said that 
I'm unconvinced that standardizing a single binary form of XML is the 
right thing to do.  I am saying that there's a lot of misinformation out 
there about what really leads to good or bad performance either for 
regular text XML or for particular binary flavors.  You can take the 
"best" (by whatever metric) binary XML in the world, force your 
application through a sub-optimal API, and your performance may be 
limited.  You can easily obscure the true differences between the 
approaches.

Actually, I believe that a careful, quantitative analysis will show that 
particular binary forms are indeed much faster for certain applications, 
especially if the APIs are tuned right.  There's no question that, for 
example, checking end tags is slower than not having to check end tags. 
The fact that alignments in XML are variable tends to slow things relative 
to formats in which counts are sent as naturally aligned integers 
(especially if you luck out and sender and receiver agree on byte order.) 
That's because almost every modern processor is much faster at loading an 
aligned number than at working through unaligned characters.  It has to do 
with how the memory and cache hierarchies are built.  It's also true that 
binary formats, even those that aren't schema aware, tend to be able to 
use string pools and string handles:  comparing integer handles is almost 
always much faster than doing string compares.  That sad fact is that most 
XML implementations, and for that matter many binary XML implementations, 
are so sub optimal at this point that those factors are being hidden by 
other unnecessary overhead.  The resulting comparisons between XML and 
Binary are noisy at best.

Now, whether the true extra overhead of text XML is really significant 
after you finish optimizing it well is a different question.  Deploying a 
good binary XML implementation onto lots of platforms will take lots of 
work.  Tuning XML implementations super-well will take lots of work.  When 
you're done, I do believe the binary will be somewhat, occasionally 
dramatically faster for many purposes.  Whether the difference will be 
significant given the overhead in the rest of the application, given 
particular choices of API, etc. will depend on your application.  I think 
the answer will be "yes" in selected important applications, and "no" in 
many others.

The main point of this note is to suggest that these questions need to be 
considered quantitatively, and with the sort of low level tests and 
benchmarks that allow you to account for the instructions your processor 
is executing.  I'm somewhat tired of hearing about Java implementations of 
XML (or binary) that are slow, bit for which nobody can say whether the 
JIT is doing a good job of inlining.  In such cases, you don't know 
whether you're measuring XML or a deficient Java optimzier.  You may not 
know whether your JIT is doing the same job on both technologies, because 
optimizers are notoriously sensitve to details of particular applications. 
 To really know what you've got, you've to get into the running code and 
see what machine code the JIT has produced (we actually did that, but we 
found it to be such a pain that we publicly reported mainly our C language 
results, for which checking the machine code is much easier.)

Anyway, I hope the paper is of interest.  We had fun doing the work.

Noah

[1] http://www2006.org/programme/item.php?id=5011

--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------
Follow-Ups:
- Re: [xml-dev] No XML Binaries? Buy Hardware
  - From: "derek denny-brown" <zuligag@gmail.com>
References:
- RE: [xml-dev] No XML Binaries? Buy Hardware
  - From: "Michael Kay" <mike@saxonica.com>
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]