[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Question] How to do incremental parsing?

From: James Strachan <james_strachan@yahoo.co.uk>
To: Tony.Coates@reuters.com, xml-dev@lists.xml.org
Date: Wed, 04 Jul 2001 12:18:42 +0100

From: <Tony.Coates@reuters.com>
> On 04/07/2001 01:27:28 "Xu, Mousheng  (SEA)" wrote:
>
> >A problem of all the current XML parsers is that they at least read the
> >whole XML document into the input stream, which can consume a lot of
memory
> >when the XML is big (e.g. 1 GB).
>
[snip]
>
> So, "use SAX or a persistent DOM" for large XML files/streams is what I
would suggest.

I agree with David and Tony that both direct SAX or persistent DOMs can be
useful.

One alternative you might find useful is to use a document object model to
parse your large document but do it in a  'pruning mode'. Often massive
documents (e.g. 1GB) are often database generated and can contain many
'rows' (document fragments) which can be processed individually without
requiring the entire document in memory at once. e.g.

<products name="foo">
    <product id="1">
        <name>foo</name>
        ...
    </product>
    <product id="2">
        ...
    </product>
    ...
    <product id="10000000">
    ...
    </product>
</product>

For example the dom4j project has an event based call back mechanism, like
SAX, which can be used to process 'rows' of a massive document in a row by
row fashion which can then be pruned from the tree when finished with and
then garbage collected.

http://dom4j.org

The neat thing about this is you are called back with a complete valid
Document object that only contains one row (<product>) at a time and you can
still use dom4j's XPath support on all aspects of the Document as well as
using XSLT.

There's an example in the FAQ here:-

http://dom4j.org/faq.html
http://dom4j.org/faq.html#How%20does%20dom4j%20handle%20very%20large%20XML%2
0documents?

James

_________________________________________________________

Do You Yahoo!?

Get your free @yahoo.com address at http://mail.yahoo.com

References:
- Re: [Question] How to do incremental parsing?
  - From: Tony.Coates@reuters.com

Prev by Date: Re: UTF-8 BOM
Next by Date: ANNOUNCE: dom4j 0.6 released
Previous by thread: Re: [Question] How to do incremental parsing?
Next by thread: RE: [Question] How to do incremental parsing?
Index(es):
- Date
- Thread