Lists Home |
Date Index |
7/11/2002 7:32:19 PM, firstname.lastname@example.org wrote:
> 2. Most people mention SAX can handle files larger
>than memory, but I am thinking, is this really the case,
>because files are read into the kernel buffer, so large
>files still have to be read into the memory, just not in
>user space. Am I right?
DOM builders generally load the entire document into a tree structure.
SAX operates at parse time; it can call a user-defined function
for each element, attribute list, entity reference, etc. The application
can choose to either process the XML data and throw it away (meaning
that the total size of the document is independent of the memory
usage) or build another data structure, store the data to DBMS,
or whatever. This provides the usual tradeoff -- more work for the
application programmer but more control over resource usage.
> 3. DOM is memory-thirsty, according to most articles I
>read. So DOM's performance lags, does anyone run any type
>of profiling, and I am interested in why it is memory
>hungry, and poor in terms of performance.
It is quite true that if one simply defines classes that
directly implement all the DOM interfaces, each Node will be
fairly large because of all the properties and methods defined
on the basic Node interface. The DOM exposes several
different models of an XML document -- a tree with parents,
children, and siblings; lists of nodes containing lists of
nodes, a more OO conception of Document, Element, Attribute, etc.
objects, and a more abstract model where the document is
traversed via iterators.
Still, this is an implementation issue, not intrinsic to the DOM
API. There are some DOM implementations that are "lazy", i.e.,
only build actual objects implementing the DOM interface when
a specific part of the document is accessed. There are also
persistent DOMs, where the parser essentially loads a database
that is then navigated and queried on demand. Both these
techniques would be less memory hungry than a straightforward
implementation of the spec.
> 4. What do people think of pull type parsers and DOM
>SAX hybrids? Are these popular and stable?
There's been a lot written on this, but you'll probably have
to sort it out for yourself. A simple Googling for
"xml pull parser performance" yields quite a number of
articles. It's probably something to consider if you have
lots of data and relatively constrained processors, but
a well-defined application. I'd say in general that
the more flexibility you need, the more you need a DOM-like
API; the more you can constrain EXACTLY what the application will
do with each bit of markup, the more you can exploit a streaming
> 5. Is it possible for SAX to support XSLT?
Well, several (most?) XSLT implementations support SAX parsers
to build the tree for transformation. Strictly speaking, however,
you're not getting some of the infinite document size /
efficiency advantages of SAX because a conformant XSLT implementation
must keep the entire document around because the stylesheet can refer
to arbitrary pieces.
There are extensions to SAXON, I believe, to support a more
efficient use of memory by having the user tell the XSLT
engine what sections of the document to look at ... see the
<saxon:preview> extension element? There are also occasional
discussions of "streaming XSLT" processors (I don't know if
any actually exist in a stable, available form) but they would
have to operate on a subset of XSLT. I should probably shut up
and let someone who knows what they're talking about explain the