[
Lists Home |
Date Index |
Thread Index
]
Hi Tatu,
Thanks for the comments! I'm glad to hear that other people have
suggested the same idea. That gives me encouragement that it actually
will be generally useful. :)
I should say a bit more about the experiences that led me to this
design. I've found that the single most common use for an XML parser is
just to read data from a file and copy it into an internal data structure.
This is inherently a one pass operation for which a streaming parser is
well suited. In spite of that, programmers will frequently use a DOM
parser instead, just because it's easier (which is a perfectly legitimate
reason). Then someone throws a 30MB XML file at it, and their program
crashes with an OutOfMemoryError.
The problem is made worse by the "memory multiplier" effect. XML is
a wordy format for storing data, and DOM is an inefficient way of storing
XML content. So 500KB of data will turn into a 2MB XML file, which turns
into 10MB of DOM nodes when you parse it.
So I had two initial goals in mind. First, make a streaming parser
just as easy to use as an in-memory parser, so people would be more likely
to choose it as their "default" parser. And second, when someone finds
they've made a bad choice of parser for their application, make it easier
to switch.
> Actually, based on what you describe, this has been
> suggested a few times, and some PoCs exist. The last
> person I remember suggesting it (or, as he put it
> "obsoliting the need for StAX") was Raf Schietekat.
Well, my goals aren't quite as ambitious as that. :) For
applications where speed is paramount over everything else, you'll never
beat a low level API like the StAX cursor API (which, in fact, is what I
used to implement my proof of concept). But when ease of implementation
is more important than parsing speed, I think it can be improved on.
Did you read the document I linked to? I think it answers many of
your questions. Specifically:
> It tends to either converge to a deferred node construction (that Xerces
> already does, although its benefits have been debated a lot), or to just
> doing things the way they'd be done in streaming.
What I'm doing is quite different from deferred node construction
(which, according to the information on the Xerces website, actually
requires *more* memory than standard DOM, not less). It truly is a
streaming parser. When you ask for the next node, it reads one element
from the file, constructs an object to represent it, then throws away all
references to that object as soon as you move on to its next sibling (or
the next sibling of any of its parent nodes).
> Possibly, but if you use it in convenient way, you
> tend to lose the potential performance benefits;
> converging towards tree models. And to get the
> benefits, you must limit yourself strictly to a subset
> of operations, but one that your API does (and can)
> not limit.
True, there are situations where having completely random access to
the content of an XML file is essential (or at least, makes things much
easier for you). But in my experience, those are the exceptions, not the
rule. And I tried to design the API in a way that supports as many common
use cases as possible without requiring you to distinguish between
streaming and in-memory parsers. See the documentation for details.
> Another concern is the mutability: tree models
> generally allow modifying of the tree, and that's one
> of the things that complicates full-blown tree models
> (adds some overhead, prevents some optimizations etc).
It's not in the current proof of concept, but mutability is one of
the next features I intend to add. My plan is that the basic interfaces
which define the API will not include mutability, but specific
implementations of them could. For example, the Element interface will
not have any way to add a child to it, but MutableElement (which
implements Element) will have an addChildNode(MutableNode) method.
>> - Many utilities can be written once, then used with
>> either parser.
>
> Maybe you have examples of such use cases in mind?
First, there's all the standard utilities you might use with any XML
content: write it to disk, validate it against a DTD, evaluate XQuery
expressions, etc. But these utilities are not limited to content that was
generated by parsing an XML file. They can be used with any data model
that implements the correct interfaces. For example, you could write an
Element implementation which generates its children dynamically based on
some algorithm, then execute XQuery expressions against it!
Then there are application-specific functions. For example, you
might need to process data stored in an XML file, using an algorithm that
involved three passes through the data. You could check the file size,
then choose a streaming parser or in-memory parser accordingly. You would
then pass the resulting Document to the processing code, which wouldn't
know or care whether the file was actually getting parsed three times or
only once.
Peter
|