xml-dev - RE: [xml-dev] Java Technology and XML : API benchmark

RE: [xml-dev] Java Technology and XML : API benchmark

[ Lists Home | Date Index | Thread Index ]

To: "'veillard@redhat.com'" <veillard@redhat.com>
Subject: RE: [xml-dev] Java Technology and XML : API benchmark
From: Nicolas LEHUEN <nicolas.lehuen@ubicco.com>
Date: Wed, 13 Mar 2002 15:20:11 +0100
Cc: "'xml-dev@lists.xml.org'" <xml-dev@lists.xml.org>

That was precisely the kind of pinches of salt I was thinking about. Don't
expect to find deeps facts about reality from micro benchmarks or informal
tests. I think benchmarking is rocket science, because it's very difficult
to design benchmarks that really mean something. But that shouldn't prevent
us from reading micro-benchmark reports.

Anyway, the document is worth reading :

- see for example the impact of validation. It's surprising to see that due
to problems in parser implementations, a document with a schema reference
show an "non-validation" overhead, even if validation is not enabled.

I still don't understand why validation is not performed in a SAX filter
rather than in parsers. Parsers like Xerces have dramatically grown in size
and have performance problem due to the fact that validation is built in in
the parser. 

To me, parsing and validating are two different activities. They may have
been integrated for performance reasons of the parsing+validation pipeline,
but I'd still like to have a clean, high performance parsing pipeline in
which I could plugin any kind of validators (eg. Sun MSV).

Take the bug that causes Xerces 2 to dereference and parse schemas even if
validation is disabled for example. I hope it has been fixed since, but if
Xerces 2 was not mixing a lot of APIs and technologies (parsing, validating,
building a DTD specific DOMs, etc.) in a monolithic way, the bug would have
never appeared in the first place.

- The getElementsByTagName is also interesting, because it clearly shows
that even if the functional behaviour of the DOM is defined by the API, its
performance behaviour is totally undefined. Either the code for this method
on Crimson is totally dumb, or there is a CPU/memory tradeoff in the other
DOMs (only an examination of the different source codes can tell).

I have always suspected that getElementsByTagName was probably poor in
performance (due to the instanciation of immutable NodeList instances), so I
never used it anyway, but this definitely reinsure me in the idea that
getElementsByTagName is inherently bad. It's in the top 10 of the "do not
use this" list we have in my team.

We use a finely tuned homegrown selector library, or XPath expressions with
Jaxen when the selectors are too complicated. Those selectors or XPath
expressions are built once and used many times without causing unnecessary
instanciations, and thus have pretty well predictable in performance.

- I found Figure 9 particularly interesting, though it's not related to XML
but to Java. Hotspot optimization can sometimes take a long time to take
place, and the fun thing here it that it takes place too late to have an
impact on the test. I tried many times to benchmark a particular piece of
code and found it extremely difficult due to the variations the GC and
Hotspot could introduce in the system. Benchmarking Java code is truly not a
simple task :).

Regards,
Nicolas

>-----Message d'origine-----
>De : Daniel Veillard [mailto:veillard@redhat.com]
>Envoye : mercredi 13 mars 2002 14:26
>A : Nicolas LEHUEN
>Cc : 'xml-dev@lists.xml.org'
>Objet : Re: [xml-dev] Java Technology and XML : API benchmark
>
>
>On Wed, Mar 13, 2002 at 01:52:48PM +0100, Nicolas LEHUEN wrote:
>> Like all benchmark made by any given "vendor" (the quotes 
>are here because
>> the different APIs are free), this should be taken with a 
>pinch of salt. It
>> is still interesting to read, though.
>> 
>> 
>http://developer.java.sun.com/developer/technicalArticles/xml/J
>avaTechandXML
>> _part2/
>
>  Did they give the input for their tests ? I don't think so. 
>What would
>become really fun is to see the result of processing those data without
>having to run through the Java stuff. I.e. reporting side by side what
>MSXML or libxml2/libxslt results would be. It's a long time since any
>XSLTMark [1] benchmark had been produced ...
>
>  Benchmarks are statistic, and hence show only a few facets of the
>real object, in this case the goal seems to be more of 
>comparing various
>processing costs in the Java environment than to make a roundup of 
>to set of tools available, but still releasing the sources would allow
>to scope those result better and give more weight to their analysis. As
>they state they are to "be considered as micro-benchmarks".
>
>  Also any "single shot" run in a Java based environment doesn't give
>good results (time to find the "Hot Spot" needing compilation) this is
>interestingly pointed explicitely shown in their "Comparing Different
>JVM Versions" part.
>
>Daniel
>
>[1] http://www.datapower.com/XSLTMark/
>
>-- 
>Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
>veillard@redhat.com  | libxml Gnome XML XSLT toolkit  
http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Follow-Ups:
- Re: [xml-dev] Java Technology and XML : API benchmark
  - From: Elena Litani <elitani@ca.ibm.com>
- RE: [xml-dev] Java Technology and XML : API benchmark
  - From: Tim Bray <tbray@textuality.com>
- Re: [xml-dev] Java Technology and XML : API benchmark
  - From: "Jeff Rafter" <jeffrafter@defined.net>

Prev by Date: Re: [xml-dev] Java Technology and XML : API benchmark
Next by Date: 3 XML Base questions
Previous by thread: Java Technology and XML : API benchmark
Next by thread: Re: [xml-dev] Java Technology and XML : API benchmark
Index(es):
- Date
- Thread