OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] A new XML parser

[ Lists Home | Date Index | Thread Index ]

Hi Tatu,

      Thanks for the comments!  I'm glad to hear that other people have 
suggested the same idea.  That gives me encouragement that it actually 
will be generally useful. :)
      I should say a bit more about the experiences that led me to this 
design.  I've found that the single most common use for an XML parser is 
just to read data from a file and copy it into an internal data structure. 
This is inherently a one pass operation for which a streaming parser is 
well suited.  In spite of that, programmers will frequently use a DOM 
parser instead, just because it's easier (which is a perfectly legitimate 
reason).  Then someone throws a 30MB XML file at it, and their program 
crashes with an OutOfMemoryError.
      The problem is made worse by the "memory multiplier" effect.  XML is 
a wordy format for storing data, and DOM is an inefficient way of storing 
XML content.  So 500KB of data will turn into a 2MB XML file, which turns 
into 10MB of DOM nodes when you parse it.
      So I had two initial goals in mind.  First, make a streaming parser 
just as easy to use as an in-memory parser, so people would be more likely 
to choose it as their "default" parser.  And second, when someone finds 
they've made a bad choice of parser for their application, make it easier 
to switch.

> Actually, based on what you describe, this has been
> suggested a few times, and some PoCs exist. The last
> person I remember suggesting it (or, as he put it
> "obsoliting the need for StAX") was Raf Schietekat.

      Well, my goals aren't quite as ambitious as that. :)  For 
applications where speed is paramount over everything else, you'll never 
beat a low level API like the StAX cursor API (which, in fact, is what I 
used to implement my proof of concept).  But when ease of implementation 
is more important than parsing speed, I think it can be improved on.
      Did you read the document I linked to?  I think it answers many of 
your questions.  Specifically:

> It tends to either converge to a deferred node construction (that Xerces 
> already does, although its benefits have been debated a lot), or to just 
> doing things the way they'd be done in streaming.

     What I'm doing is quite different from deferred node construction 
(which, according to the information on the Xerces website, actually 
requires *more* memory than standard DOM, not less).  It truly is a 
streaming parser.  When you ask for the next node, it reads one element 
from the file, constructs an object to represent it, then throws away all 
references to that object as soon as you move on to its next sibling (or 
the next sibling of any of its parent nodes).

> Possibly, but if you use it in convenient way, you
> tend to lose the potential performance benefits;
> converging towards tree models. And to get the
> benefits, you must limit yourself strictly to a subset
> of operations, but one that your API does (and can)
> not limit.

      True, there are situations where having completely random access to 
the content of an XML file is essential (or at least, makes things much 
easier for you).  But in my experience, those are the exceptions, not the 
rule.  And I tried to design the API in a way that supports as many common 
use cases as possible without requiring you to distinguish between 
streaming and in-memory parsers.  See the documentation for details.

> Another concern is the mutability: tree models
> generally allow modifying of the tree, and that's one
> of the things that complicates full-blown tree models
> (adds some overhead, prevents some optimizations etc).

      It's not in the current proof of concept, but mutability is one of 
the next features I intend to add.  My plan is that the basic interfaces 
which define the API will not include mutability, but specific 
implementations of them could.  For example, the Element interface will 
not have any way to add a child to it, but MutableElement (which 
implements Element) will have an addChildNode(MutableNode) method.

>> - Many utilities can be written once, then used with
>> either parser.
> Maybe you have examples of such use cases in mind?

      First, there's all the standard utilities you might use with any XML 
content: write it to disk, validate it against a DTD, evaluate XQuery 
expressions, etc.  But these utilities are not limited to content that was 
generated by parsing an XML file.  They can be used with any data model 
that implements the correct interfaces.  For example, you could write an 
Element implementation which generates its children dynamically based on 
some algorithm, then execute XQuery expressions against it!
      Then there are application-specific functions.  For example, you 
might need to process data stored in an XML file, using an algorithm that 
involved three passes through the data.  You could check the file size, 
then choose a streaming parser or in-memory parser accordingly.  You would 
then pass the resulting Document to the processing code, which wouldn't 
know or care whether the file was actually getting parsed three times or 
only once.



News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS