xml-dev - Re: [xml-dev] Handling very large instance docs

Re: [xml-dev] Handling very large instance docs

[ Lists Home | Date Index | Thread Index ]

To: "Karl Waclawek" <karl@waclawek.net>,<xml-dev@lists.xml.org>
Subject: Re: [xml-dev] Handling very large instance docs
From: Kevin Jones <kjouk@yahoo.co.uk>
Date: Thu, 29 Apr 2004 17:03:48 +0100
In-reply-to: <005101c42df6$3fd78470$9e539696@citkwaclaww2k>
References: <p06020415bcb6ad5e7762@[89.10.0.19]> <p05200000bcb6b91b64d3@[128.253.109.55]> <005101c42df6$3fd78470$9e539696@citkwaclaww2k>

On Thursday 29 April 2004 3:28 pm, you wrote:
> > >At the very least I need to be able to sequentially process a large
> > >document and extract an identified sub-tree (ideally denoted by an
> > >XPath expression) for run-of-the-mill tools to manipulate. I assume
> > >such a beast would need to be based on a SAX parser.
> >
> > I did exactly that in Python.  I considered building an engine that
> > could filter SAX events to those that match a limited version of
> > XPath, but ran out of gas.  I ended up with a just regular SAX
> > application.
>
> Interesting - I always thought such a thing is useful, but haven't
> come across implementation.
>

The main problem is obviously getting a good range of expression types to 
evaluate correctly and at high performance, its a hard problem. A good 
starting point for reseach in this area is http://xmltk.sourceforge.net/. 
This software there is somewhat behind in functional terms but as a free and 
easy solution to performing large document manipulation its good value.

At the 200-300Mb level I would not rule out a XSLT as a solution although you 
would have to set up your environment carfully, in particularly available 
memory and which XSLT processor.

Kev.

References:
- Handling very large instance docs
  - From: Andy Greener <andy@gid.co.uk>
- Re: [xml-dev] Handling very large instance docs
  - From: Joel Bender <jjb5@cornell.edu>
- Re: [xml-dev] Handling very large instance docs
  - From: "Karl Waclawek" <karl@waclawek.net>

Prev by Date: RE: [xml-dev] Handling very large instance docs
Next by Date: RE: [xml-dev] You call that a standard?
Previous by thread: Re: [xml-dev] Handling very large instance docs
Next by thread: RE: [xml-dev] Handling very large instance docs
Index(es):
- Date
- Thread