Lists Home |
Date Index |
On Thursday 29 April 2004 3:28 pm, you wrote:
> > >At the very least I need to be able to sequentially process a large
> > >document and extract an identified sub-tree (ideally denoted by an
> > >XPath expression) for run-of-the-mill tools to manipulate. I assume
> > >such a beast would need to be based on a SAX parser.
> > I did exactly that in Python. I considered building an engine that
> > could filter SAX events to those that match a limited version of
> > XPath, but ran out of gas. I ended up with a just regular SAX
> > application.
> Interesting - I always thought such a thing is useful, but haven't
> come across implementation.
The main problem is obviously getting a good range of expression types to
evaluate correctly and at high performance, its a hard problem. A good
starting point for reseach in this area is http://xmltk.sourceforge.net/.
This software there is somewhat behind in functional terms but as a free and
easy solution to performing large document manipulation its good value.
At the 200-300Mb level I would not rule out a XSLT as a solution although you
would have to set up your environment carfully, in particularly available
memory and which XSLT processor.