[
Lists Home |
Date Index |
Thread Index
]
- To: Michael Kay <mike@saxonica.com>, 'Rick Jelliffe' <rjelliffe@allette.com.au>, xml-dev@lists.xml.org
- Subject: RE: [xml-dev] XML Performance in a Transacation
- From: Tatu Saloranta <cowtowncoder@yahoo.com>
- Date: Thu, 23 Mar 2006 10:40:19 -0800 (PST)
- Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:Received:Date:From:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=bTi/grv755l/CWPt0T+Qx5ZC/ld6Hk/GzEzBPiVZTNzsj8KW74LxP+0itJyoohgzmJ6XeEkhRHSXNQjIFTNvgvDjJGobdtxK9vzoocjENjJFLjoL+ONQN8LYHJ6b4l4N27avCkcPR7VbGZ7POqI+FZOo8913i6aKlLzgPpoPib4= ;
--- Michael Kay <mike@saxonica.com> wrote:
> > My expectation is that XML parsing can be
> significantly sped up with ...
>
> I think that UTF-8 decoding is often the bottleneck
> and the obvious way to
> speed that up is to write the whole thing in
> assembler. I suspect the only
I think this highly depends on content, and definition
of bottleneck: for western european (ascii/iso-latin1)
subsets the difference I have observed is 15-20%;
between equivalent (7-bit ascii) content declared as
utf-8 / iso-8859-1.
Since iso-8859-1 decoding is trivially easy, the
highest possible speed up would be in 15-20% range
(except if decoding and parsing were tightly coupled
-- option I am planning to explore in future).
I would expect overhead to be more significant for
content with high ratio of non-ascii chars however.
> way of getting a significant improvement (i.e. more
> than a doubling) in
> parser speed is to get closer to the hardware. I'm
> surprised no-one has done
> it. Perhaps no-one knows how to write assembler any
> more (or perhaps, like
> me, they just don't enjoy it).
I think big reason is that pay-off just does not seem
THAT high. For c/c++ hand-coded assembly seldom yields
particularly good return (on commodity hardware); and
even going to native code from things like Java is
just an incremental improvement (if any), but with
associated drawbacks.
Besides, writing a truly compliant xml parser is
tedious (but extensive) work. ;-)
Writing specialized parsers for subsets (as in case of
what Soap requires) is easier; yet performance boosts
from all hopeful coders seem elusive when one compares
apples to apples.
The problem with XML parsing by hardware is that it
Just Does Not Pay Off: if you get, say, 20% boost (and
usually sacrificing full xml compatibility as well),
but have 20%+ overhead on memory transfer from your
card to main memory (after all, I/O is the major
overheda component of parsing nowadays), there's
little point in going through the trouble.
And this is exactly what happened with at least one of
vendors (according to comments by an engineer who
worked with one of companies: they started looking
into more lucrative areas as their "xml accelerator"
lost any boost at linux driver level).
Thing is: performance improvements for XML will need
to be found above tokenization/low-level parsing
level. There's very little left at raw parser level:
when you get raw throughputs at level between 100/1000
Mbps switched Ethernet (the earlier 40MBps rate equals
400 Mbps ethernet bit stream speed -- close or above
practical max transfer rates over gigabit ethernet),
and yet at higher processing level talk about 20 tps
for Soap (just one of figures I recently saw
attributed to Axis 1.x; with 4k messages and replies
~= 0.16 MBps), it's clear that problems are somewhere
between application code and parser.
Pure parsing performance does not degrade with
megabyte-sized input. I have no problems parsing my
500 megabyte product description data dump, and
processing it entry by entry (result set size grows
logarithmically or less).
Doing full in-memory general purpose transformation
performance does degrade in such a way, however; for
obvious (memory locality) and perhaps other reasons.
-+ Tatu +-
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
|