RE: [xml-dev] Fast validating XML parser

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: noah_mendelsohn@us.ibm.com
To: "Michael Kay" <mike@saxonica.com>
Date: Mon, 22 Oct 2007 17:18:23 -0400

I agree with Mike's intuition.  Beyond that, you'd have to do more than 
say "fast".  If you said something like:  on a 2Ghz Xeon we need to parse 
and validate 1000 messages documents per second, of average size 10K bytes 
each, with moderately dense markup, and throwing SAX events as the API, 
then it's possible that someone would have an intuition as to whether off 
the shelf parsers such as Xerces can do it.  Of course, your mileage will 
vary according to the details, but saying I need a fast parser is a bit 
like saying I need a fast car.  What you mean by fast may depend on 
whether you're driving Nascar, Formula 1, or just trying to make good time 
on a vacation.

For what it's worth, my group published a paper on some experimental work 
we did on high performance validation a few years ago.  The parser we 
described was a prototype, and it remains difficult (as far as I know) to 
find off the shelf parsers that give quite the speed we reported. 
Nonetheless, the paper includes some benchmarks for then-current versions 
of Xerces doing validation.  Those are not official Apache or IBM 
benchmarks, but they were run with some care, and I expect that Xerces has 
probably improved a bit in speed since then.  So, you might want to check 
out the paper.  It also explains in great detail some of the factors that 
we found to be issues when trying to parse and validate at high speed. 
Copies are available online at [1].  I suggest that unless you have a 
strong preference for html that you read the PDF version; the formatting 
is much better.

Noah

[1]  http://www2006.org/programme/item.php?id=5011

--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------








"Michael Kay" <mike@saxonica.com>
10/22/2007 03:42 PM
 
        To:     "'Llacuna, Phillip V'" <phillip.v.llacuna@lmco.com>, 
<xml-dev@lists.xml.org>
        cc:     (bcc: Noah Mendelsohn/Cambridge/IBM)
        Subject:        RE: [xml-dev] Fast validating XML parser


I suspect that an off-the-shelf parser like Xerces is quite fast enough if 
your application invokes it intelligently. You might find parsers that are 
20% faster than that, but I think the order-of-magnitude improvement will 
come by changing your application architecture: in particular, change the 
driving code from Javascript to Java.
 
Xerces has a fairly high start-up cost so it's worth reusing the parser 
for multiple documents. However, that's more of a factor when your files 
are 200 bytes rather than 50K bytes.
 
Michael Kay
http://www.saxonica.com/

From: Llacuna, Phillip V [mailto:phillip.v.llacuna@lmco.com] 
Sent: 22 October 2007 19:32
To: xml-dev@lists.xml.org
Subject: [xml-dev] Fast validating XML parser

Hi:
 
We need a very fast validating XML parser and was wondering if anyone has 
any suggestions? Our project involves one main XML file with about 1200 
supporting XML files (each about 50KB or less). Our current environment 
calls on a java script to validate each file against the DTD, but it is 
painfully slow to process the complete project. We suspect that that the 
overhead in creating the java environment each time the script is called 
is slowing down the process. I have searched (and am still searching) the 
web for a good alternative. Any suggestions?
 
Phillip Llacuna
Multi-media Design Engineer
Lockheed Martin
Ph:   (651) 456-7152
Fax: (651) 456-2643

References:
- RE: [xml-dev] Fast validating XML parser
  - From: "Michael Kay" <mike@saxonica.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]