[
Lists Home |
Date Index |
Thread Index
]
--- peastman@drizzle.stanford.edu wrote:
> I'm working on developing a new style of XML parser.
Actually, based on what you describe, this has been
suggested a few times, and some PoCs exist. The last
person I remember suggesting it (or, as he put it
"obsoliting the need for StAX") was Raf Schietekat.
So while not exactly new, I guess there is some merit
to the idea, since it keeps getting brought up.
Some notes/comments though:
...
> My idea is to create a single high level, DOM-like
> API which is suitable
> for both streaming and in-memory parsers. I believe
I am not convinced this is a good idea. It tends to
either converge to a deferred node construction (that
Xerces already does, although its benefits have been
debated a lot), or to just doing things the way they'd
be done in streaming. That is, I see it as swiss
pocketknife of two very different tools. To me it's
much more natural to layer things, so that tree
builder strictly sits on top of streaming parser.
That's how most current systems do it (XOM,
JDOM/Dom4j, even Xerces SAX+DOM).
> this design has
> several advantages over existing parsers:
>
> - It's much easier to use than other streaming
> parsers like SAX or StAX,
> since you get to work with a high level, object
> oriented representation of
Possibly, but if you use it in convenient way, you
tend to lose the potential performance benefits;
converging towards tree models. And to get the
benefits, you must limit yourself strictly to a subset
of operations, but one that your API does (and can)
not limit.
> the XML content. It's very similar to existing DOM
> APIs. The only
> restriction is that, if you're using a streaming
> parser, you're required
> to access the nodes in the order they appear in the
> file.
To me, this is the main problem however: you pretty
much MUST build the tree, even if calling code _seems
to_ access things in order. Unless you force that code
to indicate something "I promise to process them in
order, all the time", there's nothing you can do to
avoid buffering all the data. And that means
eager/deferred node construction. On the other hand,
if you do require some kinds of hints, it's not
exactly single API any more. It's a dualistic API with
two very different operational modes; and its
questionable if it's any easier than 2 clearly
separate APIs.
Another concern is the mutability: tree models
generally allow modifying of the tree, and that's one
of the things that complicates full-blown tree models
(adds some overhead, prevents some optimizations etc).
Streaming models allow very limited mutability: in SAX
you can modify current event easily; in StAX you
essentially have separate components (parser,
serializer). In both cases you modify stream serially.
You could make API read-only, but then it'd be much
more limited than existing options.
> - Switching from an in-memory parser to a streaming
> parser (or vice versa)
> is much easier than it would be with any other two
> parsers, because both
> of them use exactly the same API. You can even
Note, though, that using the same API, and using the
API same way (usage patterns) are not the same thing.
...
> - Many utilities can be written once, then used with
> either parser.
Maybe you have examples of such use cases in mind?
Having said all of above, good luck with your
proposal; it can be fun developing new ways to deal
with old problems. ;-)
-+ Tatu +-
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
|