xml-dev - Re: [xml-dev] VTD-XML an open-source, high-performance and non-extractiv

Re: [xml-dev] VTD-XML an open-source, high-performance and non-extractiv

[ Lists Home | Date Index | Thread Index ]

To: Harry Xu <harry.hxu@gmail.com>
Subject: Re: [xml-dev] VTD-XML an open-source, high-performance and non-extractiveXML processing API
From: Elliotte Harold <elharo@metalab.unc.edu>
Date: Tue, 18 Oct 2005 05:30:18 -0400
Cc: xml-dev@lists.xml.org
In-reply-to: <bbe40a480510171950m6fb4db02u7d913de27e936d14@mail.gmail.com>
References: <bbe40a480510171950m6fb4db02u7d913de27e936d14@mail.gmail.com>
User-agent: Thunderbird 1.4 (Macintosh/20050908)

Harry Xu wrote:
> I am pleased to announce that both Java and C version 1.0 of
> VTD-XML -- an open-source, high-performance and non-extractive
> XML processing API -- is freely available on sourceforge.net 
> <http://sourceforge.net>.
> For source code, documentation, detailed description of API
> and code examples, please visit
> 
> http://vtd-xml.sourceforge.net 
> <javascript:ol('http://vtd-xml.sourceforge.net');>
> 

Interesting, but could you please not use JavaScript links in e-mail? 
For that matter, could you please not use javascript links at all? :-)

This appears to be an example of what Sam Wilmot calls "in situ 
parsing". In other words, rather than creating objects representing the 
content of an XML document, just pass pointers into the actual, real 
XML. In some cases you wouldn't even need to hold the document in 
memory. It could remain on disk. Many, though not all, use cases could 
see an order of magnitude speed-up or better from such an approach. 
Memory usage could improve too. Current tree models typically require at 
least 3 times the size of the actual document, more often more. Using a 
model based on indexes into one big array might allow these to reduce 
their requirements to twice the size of the original document or even 
less. Finally, this approach would make retrieving the actual original 
text of the document feasible, so you could finally tell whether a 
document used &amp; or &#x0026;. Most programs don't need this ability, 
but it would be very useful for XML editors and other programs that want 
to do better round-tripping.

On VTD-XML itself,  I read on the web site that "Currently it only 
supports built-in entity references(&quot; &amps; &apos; &gt; &lt;)." 
That means it's not an XML parser. Given this, the comparisons you make 
to other parsers are unfair and misleading. I've seen many products that 
outperform real XML parsers by subsetting XML and cutting out the hard 
parts. It's often the last 10% that kills the performance. :-(

The other question I have for anything claiming these speed gains is 
whether it correctly implements well-formedness testing, including the 
internal DTD subset. Will VTD-XML correctly report all malformed 
documents as malformed? Has it been tested against the W3C XML 
conformance test suite?

Finally, even if everything works out once the holes are plugged,  this 
seems like it would be slower than SAX/StAX for streaming use cases. 
VTD, like DOM, needs to read the entire document before it can work on 
any of it. (If that's not true, please correct me.) ) SAX/StAX can begin 
processing the beginning of a document before most of the document has 
even arrived from the network. This isn't relevant to all use cases, but 
it's very relevant for many of the cases where speed is most critical 
and most problematic.

-- 
Elliotte Rusty Harold  elharo@metalab.unc.edu
XML in a Nutshell 3rd Edition Just Published!
http://www.cafeconleche.org/books/xian3/
http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosim

Follow-Ups:
- Re: [xml-dev] VTD-XML an open-source, high-performance and non-extractive XML processing API
  - From: Jimmy Zhang <jzhang2004@sbcglobal.net>
- Re: [xml-dev] VTD-XML an open-source, high-performance and non-extractive XML processing API
  - From: Michael Champion <michaelc.champion@gmail.com>

References:
- VTD-XML an open-source, high-performance and non-extractive XML processing API
  - From: Harry Xu <harry.hxu@gmail.com>

Prev by Date: Transmission of XML Data
Next by Date: Re: [xml-dev] Transmission of XML Data
Previous by thread: VTD-XML an open-source, high-performance and non-extractive XML processing API
Next by thread: Re: [xml-dev] VTD-XML an open-source, high-performance and non-extractive XML processing API
Index(es):
- Date
- Thread