OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] VTD-XML an open-source, high-performance and non-extractiv

[ Lists Home | Date Index | Thread Index ]
  • To: Elliotte Harold <elharo@metalab.unc.edu>, Harry Xu <harry.hxu@gmail.com>
  • Subject: Re: [xml-dev] VTD-XML an open-source, high-performance and non-extractive XML processing API
  • From: Jimmy Zhang <jzhang2004@sbcglobal.net>
  • Date: Tue, 18 Oct 2005 10:57:58 -0700 (PDT)
  • Cc: xml-dev@lists.xml.org
  • Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=sbcglobal.net; h=Message-ID:Received:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=AECDxNTd+LyYFy65rGKCZxCKhYuE3iS8tXU7U1TXxCA3dWbKj3nraW31g2NaXcF06aQfVpcvhGLATKNonULMQozdLG2tilg6jN5UWQdUWy6hyKWSOIvFfPtbhn+vJ01vqTg+giHlHN64gHFEtoC+nr+PGtTGFnbqKr9YLIqCvBI= ;
  • In-reply-to: <4354C0AA.2020502@metalab.unc.edu>

Elliotte, Thanks for your comment. 
As one of the original developers, The first question I have to ask is
where the XML performance is most worth the optimizing effort?
SOAP is a good starting point, and it doesn't require DTD.
So the approach of VTD-XML is to do things where it makes the
most sense. In other words, the last 10% is outside the scope of
optimization.
 
Coming from a hardware background, we design VTD to overcome
some of problems of text processing, namely, variable string length.
So creating index into XML doc is not necessary, but no sufficient,
the representation of string must be constant in length as well...
 
VTD-XML is not a perfectsolution, it will work for what it is designed for,
in a way, it is the byproduct of XML evolution ...
 
Thanks,
JZhang
 
  


Elliotte Harold <elharo@metalab.unc.edu> wrote:
Harry Xu wrote:
> I am pleased to announce that both Java and C version 1.0 of
> VTD-XML -- an open-source, high-performance and non-extractive
> XML processing API -- is freely available on sourceforge.net
> .
> For source code, documentation, detailed description of API
> and code examples, please visit
>
> http://vtd-xml.sourceforge.net
>
>

Interesting, but could you please not use JavaScript links in e-mail?
For that matter, could you please not use javascript links at all? :-)

This appears to be an example of what Sam Wilmot calls "in situ
parsing". In other words, rather than creating objects representing the
content of an XML document, just pass pointers into the actual, real
XML. In some cases you wouldn't even need to hold the document in
memory. It could remain on disk. Many, though not all, use cases could
see an order of magnitude speed-up or better from such an approach.
Memory usage could improve too. Current tree models typically require at
least 3 times the size of the actual document, more often more. Using a
model based on indexes into one big array might allow these to reduce
their requirements to twice the size of the original document or even
less. Finally, this approach would make retrieving the actual original
text of the document feasible, so you could finally tell whether a
document used & or &. Most programs don't need this ability,
but it would be very useful for XML editors and other programs that want
to do better round-tripping.

On VTD-XML itself, I read on the web site that "Currently it only
supports built-in entity references(" &s; &apos; > <)."
That means it's not an XML parser. Given this, the comparisons you make
to other parsers are unfair and misleading. I've seen many products that
outperform real XML parsers by subsetting XML and cutting out the hard
parts. It's often the last 10% that kills the performance. :-(

The other question I have for anything claiming these speed gains is
whether it correctly implements well-formedness testing, including the
internal DTD subset. Will VTD-XML correctly report all malformed
documents as malformed? Has it been tested against the W3C XML
conformance test suite?

Finally, even if everything works out once the holes are plugged, this
seems like it would be slower than SAX/StAX for streaming use cases.
VTD, like DOM, needs to read the entire document before it can work on
any of it. (If that's not true, please correct me.) ) SAX/StAX can begin
processing the beginning of a document before most of the document has
even arrived from the network. This isn't relevant to all use cases, but
it's very relevant for many of the cases where speed is most critical
and most problematic.

--
Elliotte Rusty Harold elharo@metalab.unc.edu
XML in a Nutshell 3rd Edition Just Published!
http://www.cafeconleche.org/books/xian3/
http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosim

-----------------------------------------------------------------
The xml-dev list is sponsored by XML.org , an
initiative of OASIS

The list archives are at http://lists.xml.org/archives/xml-dev/

To subscribe or unsubscribe from this list use the subscription
manager:





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS