[
Lists Home |
Date Index |
Thread Index
]
- To: xml-dev@lists.xml.org
- Subject: Re: [xml-dev] XML Performance in a Transacation
- From: Rick Jelliffe <rjelliffe@allette.com.au>
- Date: Fri, 24 Mar 2006 14:41:08 +1100
- In-reply-to: <200603231325.k2NDPBG8029517@modelo.allette.com.au>
- References: <200603231325.k2NDPBG8029517@modelo.allette.com.au>
- User-agent: Mozilla Thunderbird 0.6 (X11/20040502)
Michael Kay wrote:
>>My expectation is that XML parsing can be significantly sped up with ...
>>
>>
>
>I think that UTF-8 decoding is often the bottleneck and the obvious way to
>speed that up is to write the whole thing in assembler. I suspect the only
>way of getting a significant improvement (i.e. more than a doubling) in
>parser speed is to get closer to the hardware. I'm surprised no-one has done
>it. Perhaps no-one knows how to write assembler any more (or perhaps, like
>me, they just don't enjoy it).
>
>
Yes.* The technique using C++ intrinsics (which is assembler in
disguise) I gave in my blog (URL in previous post) gives a *four to
five* times speed increase compared to fairly tight C++ code, for the
libxml utf-8 to UTF-16 transcoder, for ASCII valued data.
The fact that this gives such a large increase in what is sometimes
deemed to be a bottleneck augers well for XML speed-ups generally. *If*
the dynamic for optimization were there. Unfortunately, I expect that
(MS excepted) that the people most active in internationalization
libraries (like Mark Davis' wonderful ICU project at IBM, which feeds
into Java for example, or libxml) are more keen on cross-platform
portability (in C++ and Java) rather than exploiting the fast
instructions of particular CPUs.
With modern CPUs and things like SSE2, even the best C++ (without
intrinsics) or Java code simply cannot compete with C++ with intrinsics
or with assembler; Java has certainly caught up with C++ but the
introduction of intrinsics has shifted the goalposts for C++ far beyond
what Java can cope with. The simple reason is that you need to use
particular data structures (e.g. 16 byte arrays) and particular
(non-stalling) algorithms, and you need to have some kind of pragma or
tag to tell the compiler that your code is SSE friendly: it is just too
hard, hence the rise of intrinsics and the competitive fall of Java/C# etc.
Because Open Source XML software generally is either Java or
cross-platform C++, there hasn't been the dynamic there to add
processor-dependent optimizations or algorithms. I hope that will
change. But the change will happen faster with user demand: in
particular, the organizations with high transaction needs who are
currently doing such a convincing impersonation of passive beached
whales. They need to be more proactive in stimulating open source
development in more efficient XML; it directly addresses their business
requirements for high transaction rates. Hence my suggestion for a
consortium offering a cash prize.
There are still a lot of CPUs without SSE out there. I also develop
digital audio filters for modular synthesizers, as a hobby, and I am
surprised at the number of older machines stillout there. But they are
disappearing fast.
Cheers
Rick Jelliffe
* Yes for getting closer to the hardware. But assembler is not necessary
IMHO: compilers generate great code...the big performance potential is
from utilizing the pipelining/streaming instructions on small arrays, in
particular SSE2 on x86. You don't need to resort to assembler because of
availability of C++ intrinsic functions.
I was surprised about them: I had thought C and C++ had stopped evolving
in any interesting way in the early 90s and that the action was in
Java/C#. But these Intrinsics utterly change the tradeoffs for Java
versus C++, back in favour of C++. I (and I think a lot of my generation
of programmers who switched to Java in the 90s) missed the advent of
Intrinsics this decade. I think Java still wins for WORA Java versus
WORA C++, but for optimised code, it is no contest: Java simply cannot
compete with C++ with intrinsics because Java can only make use of SSE2
instructions piecemeal, just as normal instructions, but has no
capability of making use of their pipelining or virtual parallelism.
In DSP, compiling the same C++ code to use SSE instructions usually
results in a 30% increase in performance. Small, but good. C+ compilers
with "generate SSE" ticked, or Javva compilers can get this kind of
advantage now with. But the real speed up, the speed up I am talking
about, is to use the SSE(2) code for pipelining, which requires that you
structure your algorithm in a way to suit the CPU. That is beyond what
we can expect a Java compiler to do unaided.
So please, no flames or pointers to benchmarks of Java versus C++ based
on code that is not written to utilize intrinsics: they are so 90s!
Java needs to add the equivalent of intriniscs (i.e. in the way that
they have system dependent arraycopy method), to catch up with C++.
There may be some way to make it slightly more CPU independent (e.g.
make the arraysize of MIMD data such as _i128 into a System constant.)
|