OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Question] How to do incremental parsing?



For large documents where:
	a) the target data is sparse (bits scattered throughout the large
document), and
	b) the location of the data is known (what branch it hangs off the
tree),
then you might get the best performance, speed- and memory-wise, using a
pull-based parser like kXML (www.enhydra.org). With a pull-based parser, you
lightly skip over the nodes you're not interested in until you find a node
that has content you're looking for. 

Caveats: 
1) I don't know if anyone has actually done performance testing to verify
the above claim, and 
2) kXML, at least, has some limitations, quote:
- kXML does not support user defined (external) entities. 
- The doctype declaration is not parsed. However, a corresponding "legacy
event" is generated by the    parser, so application programmers are able to
parse the doctype declaration themself 

> -----Original Message-----
> From: Xu, Mousheng (SEA) [mailto:Mousheng.Xu@sea.celltechgroup.com]
> Sent: Tuesday, July 03, 2001 5:27 PM
> To: 'xml-dev@lists.xml.org'
> Subject: [Question] How to do incremental parsing?
> 
> 
> Dear all,
> 
> A problem of all the current XML parsers is that they at 
> least read the
> whole XML document into the input stream, which can consume a 
> lot of memory
> when the XML is big (e.g. 1 GB).
> 
> One way to get around the problem would be to read the XML 
> file into memory
> gradually and when needed. I would like to build such a DOM 
> parser, but I am
> not familiar with the design of the Xerces XML parsers. Could 
> someone give
> me a suggestion on how to tackle on the problem? The most 
> critical part
> would be the method to parse an element. If reading the whole 
> document into
> memory is inevitable, then I would like to borrow the method 
> which parse the
> input stream to get the next element.
> 
> Your help is highly appreciated.
> 
> Thanks in advance.
> 
> -- Mousheng Xu 
> 
> 
> The information contained in this email is intended for the
> personal and confidential use of the addressee only. It may
> also be privileged information. If you are not the intended
> recipient then you are hereby notified that you have received
> this document in error and that any review, distribution or
> copying of this document is strictly prohibited. If you have
> received  this communication in error, please notify Celltech
> Group immediately on:
> 
> +44 (0)1753 534655, or email 'is@celltech.co.uk'
> 
> Celltech Group plc
> 216 Bath Road, Slough, SL1 4EN, Berkshire, UK
> 
> Registered Office as above. Registered in England No. 2159282
> 
> ------------------------------------------------------------------
> The xml-dev list is sponsored by XML.org, an initiative of OASIS
> <http://www.oasis-open.org>
> 
> The list archives are at http://lists.xml.org/archives/xml-dev/
> 
> To unsubscribe from this elist send a message with the single word
> "unsubscribe" in the body to: xml-dev-request@lists.xml.org
>