Lists Home |
Date Index |
On 10/18/05, Elliotte Harold <email@example.com> wrote:
> On VTD-XML itself, I read on the web site that "Currently it only
> supports built-in entity references(" &s; ' > <)."
> That means it's not an XML parser. Given this, the comparisons you make
> to other parsers are unfair and misleading. I've seen many products that
> outperform real XML parsers by subsetting XML and cutting out the hard
> parts. It's often the last 10% that kills the performance. :-(
Well, they do say right up front: "VTD-XML is a non-validating,
'non-extractive" XML processing software API implementing Virtual
Token Descriptor. Currently it only supports built-in entity
references(" &s; ' > <).' Arguably an XML
processing API doesn't have to be a real XML parser *if* the subset it
supports is clearly stated. I would have to agree that in principle
"XML" should be used to refer only to the full spec, but that battle
was lost years ago -- SOAP implicitly subsets XML, RSS is often not
well-formed (and thus not "XML"), but this distinction is lost on the
vast majority of XML technology users who do not subscribe to xml-dev.
As with most things in life, people need to just pick their poison.
Given the efficiency issues, is it better to subset XML and process
something that looks a lot like real XML efficiently with tools such
as VTD-XML, is it better to build a more fully conformant Efficient
XML Interchange (the sanitized term for what we used to call "binary
XML"), is it better to lower customer expectations about
performance/bandwidth consumption, or what? None of them are
palatable, but people have to choose which is least toxic to their own
> The other question I have for anything claiming these speed gains is
> whether it correctly implements well-formedness testing, including the
> internal DTD subset. Will VTD-XML correctly report all malformed
> documents as malformed?
> Finally, even if everything works out once the holes are plugged, this
> seems like it would be slower than SAX/StAX for streaming use cases.
> VTD, like DOM, needs to read the entire document before it can work on
> any of it.
I think the point is that the process that creates the XML can confirm
that it is well-formed / valid, and produce a VTD associated with a
document/message, then downstream processes that understand VTD can
exploit it. Those that do not understand VTD can simply use the XML
text. Yes this requires a level of trust in the producer that pure
XML text processing does not require. I've always seen this as
hitting a sweet spot (for *some* use cases!) between text XML and
binary XML where the designers of an application decide that the cost
of verifying that the producer got the XML right outweighs the
benefits of catching the errors. We can argue about how common those
scenarios are, of course, but at any point in the processing chain, a
specific component can ignore the VTD and parse the XML to verify
whatever needs to be verified.
Obviously VTD doesn't reduce the size of the XML transmitted, so it
doesn't meet the use cases that the W3C XBC / EXI folks are focused
on. On the other hand, it sounds promising for messaging scenarios
with multiple intermediaries that do routing, filtering, DSig
verification, and perhaps encryption -- raw XML parsing is quite
expensive, but could be accelerated by using the VTD to quickly find
the offsets in the message that a particular intermediary knows/cares
about. Obviously that doesn't work at all for infinite streams of XML.
Overall, my concern is that we as an industry neither look for magic
fixes that solve all known efficiency problems (which arguably the
W3C is about to futilely attempt to do) nor reject approaches, e.g.
VTD, that pluck some low-hanging fruit but don't handle all use cases.