Re: [xml-dev] I have implemented SAX based XPath Engine

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Philippe Poulard <philippe.poulard@sophia.inria.fr>
To: Michael Kay <mike@saxonica.com>
Date: Fri, 20 Feb 2009 17:25:07 +0100

Hi,

I have implemented few years ago such a tool, the global design is 
described here:
http://reflex.gforge.inria.fr/saxPatterns.html

It is designed to filter SAX streams with XPath-based patterns, but has 
also useful filters to process text (non-XML) inputs:
http://reflex.gforge.inria.fr/tutorial-pipelinesAndFilters.html

XPath is rather well supported: position() and last() are supported but 
it doesn't support preceding:: and preceding-sibling:: axis except in 
very few circumstances, but I also propose a workaround when filtering a 
SAX stream (juggling with a local DOM subtree when necessary).

To answer to Michael about how predicates are evaluated when reading 
forward is required, the engine uses a lookahead buffer and goes on 
reading until the actual predicate becomes evaluable; for that purpose, 
as explained in the article, the engine uses coroutines that are 
implemented using threads (one to evaluate the predicate, the other to 
hold the position in the call stack of the current startElement() 
event); I think that a finite state machine based on a pull parser would 
be much more efficient: although the stuff works somewhat well, I have 
noticed that it runs slowly when I use lots of XPath patterns in a 
pipeline made of lots of filters, and it can be an issue when reading GB 
of XML. I also know that things here and there have to be optimized, for 
example instead of evaluating the entire set of XPath patterns on each 
event, I could recognize that a subset is irrelevant for a given branch 
and I should discard them in that branch (but currently it doesn't work 
like that); there are also things to do better about partial evaluation 
specifically when comparison operators are involved, for example 
[count(foo)>9] should exit when 10 <foo>s are met rather than when the 
1000000 specimen are read. I have imagined a strategy where the count() 
function should return something different than a number, a numeric 
object evaluable several times by the operator that could fetch more 
data on demand, until the NumberThatIsAtLeast object reach (or not) the 
expected value. Lots of work in sigth.

Of course I will have a look at Santhosh's work :)

-- 
Cordialement,

               ///
              (. .)
  --------ooO--(_)--Ooo--------
|      Philippe Poulard       |
  -----------------------------
  http://reflex.gforge.inria.fr/
        Have the RefleX !

Follow-Ups:
- RE: [xml-dev] I have implemented SAX based XPath Engine
  - From: "Michael Kay" <mike@saxonica.com>

References:
- I have implemented SAX based XPath Engine
  - From: Santhosh T <santhosh.tekuri@gmail.com>
- RE: [xml-dev] I have implemented SAX based XPath Engine
  - From: "Michael Kay" <mike@saxonica.com>
- Re: [xml-dev] I have implemented SAX based XPath Engine
  - From: Santhosh T <santhosh.tekuri@gmail.com>
- RE: [xml-dev] I have implemented SAX based XPath Engine
  - From: "Michael Kay" <mike@saxonica.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]