Re: [xml-dev] No XML Binaries? Buy Hardware

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
From: "derek denny-brown" <zuligag@gmail.com>
To: "noah_mendelsohn@us.ibm.com" <noah_mendelsohn@us.ibm.com>
Date: Fri, 23 Feb 2007 10:44:16 -0800
Noah comments remind me of a test I did a while back.  I had a toy
object serializer that I was playing with and was doing some
performance measurements.  I wish I had the code still, but I found
that a surprisingly larger portion of the runtime was string compares.
 The System.Xml XmlTextReader already atomizes name strings, so I
added integer indices, and added local-name and namespace-uri
accessors that returned these indices.  It complicated the hydrator
code, but it sped it up noticably.

Another common activity was parsing an element/attribute value as an
int. I added a fast-path for parsing element content as an integer.
The theory was to bail out to a simpler path if the value contained
character/entity references or other markup.  It also provided
noticeable improvements when the hydrator was parsing lots of
integers.  Attributes are harder because they are parsed an normalized
before the client code has a chance to indicate how to parse the
values, so I punted on that aspect.

I would love to see a StAX-like parsing API that provided such
conveniences.  I was envisioning being able to support scenarios like
that when we designed the ReadElementValueAs* methods added to
System.Xml.XmlReader in .NET V2.


On 2/23/07, noah_mendelsohn@us.ibm.com <noah_mendelsohn@us.ibm.com> wrote:
> Some readers of this thread may be interested in the paper my Research
> group published last year at XML 2006, titled:  "XML Screamer: An
> Integrated Approach to High Performance XML Parsing, Validation and
> Deserialization" (Full paper available online at [1]).  It makes the case
> that there are certain factors that should be considered >quantitatively<
> when making statements about which technologies are fast or slow.
>
> For example, if CPU time is the primary concern (as opposed to, say, size
> and/or transmission time), then you really need to ask yourself questions
> like:  how many CPU instructions per input byte are being executed by the
> implementation I have in mind, and is that number in some sense
> reasonable?    What we found in the case of XML was that a lot of people
> were running around making statements like "regular text based XML is too
> slow for application X", and then you'd ask them questions like, "what
> parser are you using and with what API?" They might answer:  Xerces with
> SAX feeding some Web Services deserializeer.  Well, when you look at what
> a processor like that is doing, the answer is that it's executing hundreds
>  of instructions, on average per input byte.  Now you ask why, and you
> find out that some of that overhead is inherent in what seem to us the
> best possible approaches (e.g. it seems essential to do at least some form
> of comparison on each byte of input if you are to check well formedness),
> but much of the overhead comes from things like doing UTF-8 to UTF-16
> conversion of tags, many of which are just again string compared (in their
> long UTF-16 form!) after SAX hands them up to the application or
> deserializer.  With better APIs, you can extract the necesary information
> from text XML much, much faster.  On the other hand, for other
> applications you may really need SAX or DOM.  The point is to measure both
> binary and text against the particular applications of interest, and using
> APIs representative of what you'd deploy to optimize each in that context.
>
> The point of the above example was that it was not just XML itself that
> was causing the overhead:  it was XML along with the particular choice of
> APIs and processing layers, perhaps in some cases aggravated by
> implementations that just weren't as careful as they might have been.
> Indeed, one of the main reasons that Xpat is faster, though still not as
> fast as we managed to go in our experiments, is that it passes strings
> around in the native encoding of the input document.  Also:  I've got
> nothing against SAX.   It's fine as a standard for interoperation at
> medium speed.  The fact that it's so much faster than most DOM's has led
> to the misapprehension that it's not a performance bottleneck relative to
> what XML can do.  In many contexts, it is a bottleneck.
>
> Am I saying Binary XML is a bad idea? Not at all, though I've said that
> I'm unconvinced that standardizing a single binary form of XML is the
> right thing to do.  I am saying that there's a lot of misinformation out
> there about what really leads to good or bad performance either for
> regular text XML or for particular binary flavors.  You can take the
> "best" (by whatever metric) binary XML in the world, force your
> application through a sub-optimal API, and your performance may be
> limited.  You can easily obscure the true differences between the
> approaches.
>
> Actually, I believe that a careful, quantitative analysis will show that
> particular binary forms are indeed much faster for certain applications,
> especially if the APIs are tuned right.  There's no question that, for
> example, checking end tags is slower than not having to check end tags.
> The fact that alignments in XML are variable tends to slow things relative
> to formats in which counts are sent as naturally aligned integers
> (especially if you luck out and sender and receiver agree on byte order.)
> That's because almost every modern processor is much faster at loading an
> aligned number than at working through unaligned characters.  It has to do
> with how the memory and cache hierarchies are built.  It's also true that
> binary formats, even those that aren't schema aware, tend to be able to
> use string pools and string handles:  comparing integer handles is almost
> always much faster than doing string compares.  That sad fact is that most
> XML implementations, and for that matter many binary XML implementations,
> are so sub optimal at this point that those factors are being hidden by
> other unnecessary overhead.  The resulting comparisons between XML and
> Binary are noisy at best.
>
> Now, whether the true extra overhead of text XML is really significant
> after you finish optimizing it well is a different question.  Deploying a
> good binary XML implementation onto lots of platforms will take lots of
> work.  Tuning XML implementations super-well will take lots of work.  When
> you're done, I do believe the binary will be somewhat, occasionally
> dramatically faster for many purposes.  Whether the difference will be
> significant given the overhead in the rest of the application, given
> particular choices of API, etc. will depend on your application.  I think
> the answer will be "yes" in selected important applications, and "no" in
> many others.
>
> The main point of this note is to suggest that these questions need to be
> considered quantitatively, and with the sort of low level tests and
> benchmarks that allow you to account for the instructions your processor
> is executing.  I'm somewhat tired of hearing about Java implementations of
> XML (or binary) that are slow, bit for which nobody can say whether the
> JIT is doing a good job of inlining.  In such cases, you don't know
> whether you're measuring XML or a deficient Java optimzier.  You may not
> know whether your JIT is doing the same job on both technologies, because
> optimizers are notoriously sensitve to details of particular applications.
>  To really know what you've got, you've to get into the running code and
> see what machine code the JIT has produced (we actually did that, but we
> found it to be such a pain that we publicly reported mainly our C language
> results, for which checking the machine code is much easier.)
>
> Anyway, I hope the paper is of interest.  We had fun doing the work.
>
> Noah
>
> [1] http://www2006.org/programme/item.php?id=5011
>
> --------------------------------------
> Noah Mendelsohn
> IBM Corporation
> One Rogers Street
> Cambridge, MA 02142
> 1-617-693-4036
> --------------------------------------
>
>
>
>
>
> _______________________________________________________________________
>
> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
> to support XML implementation and development. To minimize
> spam in the archives, you must subscribe before posting.
>
> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
> subscribe: xml-dev-subscribe@lists.xml.org
> List archive: http://lists.xml.org/archives/xml-dev/
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>
>
Follow-Ups:
- Re: [xml-dev] No XML Binaries? Buy Hardware
  - From: Elliotte Harold <elharo@metalab.unc.edu>
References:
- RE: [xml-dev] No XML Binaries? Buy Hardware
  - From: "Michael Kay" <mike@saxonica.com>
- RE: [xml-dev] No XML Binaries? Buy Hardware
  - From: noah_mendelsohn@us.ibm.com
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]