Re: [xml-dev] I have implemented SAX based XPath Engine

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Philippe Poulard <philippe.poulard@sophia.inria.fr>
To: Michael Kay <mike@saxonica.com>
Date: Fri, 20 Feb 2009 18:14:08 +0100

Michael Kay a �crit :
>> I have implemented few years ago such a tool, the global 
>> design is described here:
>> http://reflex.gforge.inria.fr/saxPatterns.html
> 
> Thanks, I had seen this article before but it was worth rereading.
> 
> I'm puzzled by the concept of keeping exactly four counters. For an
> expression like
> 
>   child::p[@a=17][5]

you're right, and grouping is not supported either:

this one should work:
/foo/*/*[last()]
but this one doesn't :(
(/foo/*/*)[last()]

> you need a counter of how many nodes have satisfied the @a=17 predicate. I
> would have thought it was better to generalize this so that a counter is
> always relative to a predicate, for example p[17] is treated as
> child::node()[self::p][17] and the counter is relative to the predicate
> [self::p].

great idea !

I will correct my documentation about those unsupported cases

> Doing this using threads does look like a recipe for adding performance
> overheads. (Saxon's streaming sometimes uses two threads, but never more). 

I use 2 threads for each filter, and in a pipeline with n filters, there 
will be n+1 threads

I
> would have thought the great advantage of using a push model is that you can
> push the same events to multiple destinations without the use of multiple
> threads. Saxon does this when validating instance documents against XML
> Schema identity constraints, where any number of constraints can be active
> at any time. I agree with your comment, however, that the ideal would be to
> compile some kind of finite state machine. That would work, of course,
> regardless of whether the evaluation is in push or pull mode.
> 
> The XSL WG has been doing work on streaming over the last year or so (hence
> my interest). There's nothing available to publish yet, but we've made some
> good progress and have built up quite a substantial collection of use cases.
> We've been trying to define constructs that are amenable to "pure streaming"
> without any buffering or lookahead, but it's difficult to draw the
> boundaries.

Very interesting, I look forward to read all that !

Best regards,
Philippe

> 
> Michael Kay
> http://www.saxonica.com/
> 
>> It is designed to filter SAX streams with XPath-based 
>> patterns, but has also useful filters to process text 
>> (non-XML) inputs:
>> http://reflex.gforge.inria.fr/tutorial-pipelinesAndFilters.html
>>
>> XPath is rather well supported: position() and last() are 
>> supported but it doesn't support preceding:: and 
>> preceding-sibling:: axis except in very few circumstances, 
>> but I also propose a workaround when filtering a SAX stream 
>> (juggling with a local DOM subtree when necessary).
>>
>> To answer to Michael about how predicates are evaluated when 
>> reading forward is required, the engine uses a lookahead 
>> buffer and goes on reading until the actual predicate becomes 
>> evaluable; for that purpose, as explained in the article, the 
>> engine uses coroutines that are implemented using threads 
>> (one to evaluate the predicate, the other to hold the 
>> position in the call stack of the current startElement() 
>> event); I think that a finite state machine based on a pull 
>> parser would be much more efficient: although the stuff works 
>> somewhat well, I have noticed that it runs slowly when I use 
>> lots of XPath patterns in a pipeline made of lots of filters, 
>> and it can be an issue when reading GB of XML. I also know 
>> that things here and there have to be optimized, for example 
>> instead of evaluating the entire set of XPath patterns on 
>> each event, I could recognize that a subset is irrelevant for 
>> a given branch and I should discard them in that branch (but 
>> currently it doesn't work like that); there are also things 
>> to do better about partial evaluation specifically when 
>> comparison operators are involved, for example [count(foo)>9] 
>> should exit when 10 <foo>s are met rather than when the 
>> 1000000 specimen are read. I have imagined a strategy where 
>> the count() function should return something different than a 
>> number, a numeric object evaluable several times by the 
>> operator that could fetch more data on demand, until the 
>> NumberThatIsAtLeast object reach (or not) the expected value. 
>> Lots of work in sigth.
>>
>> Of course I will have a look at Santhosh's work :)
>>
>> --
>> Cordialement,
>>
>>                ///
>>               (. .)
>>   --------ooO--(_)--Ooo--------
>> |      Philippe Poulard       |
>>   -----------------------------
>>   http://reflex.gforge.inria.fr/
>>         Have the RefleX !
> 


-- 
Cordialement,

               ///
              (. .)
  --------ooO--(_)--Ooo--------
|      Philippe Poulard       |
  -----------------------------
  http://reflex.gforge.inria.fr/
        Have the RefleX !

References:
- I have implemented SAX based XPath Engine
  - From: Santhosh T <santhosh.tekuri@gmail.com>
- RE: [xml-dev] I have implemented SAX based XPath Engine
  - From: "Michael Kay" <mike@saxonica.com>
- Re: [xml-dev] I have implemented SAX based XPath Engine
  - From: Santhosh T <santhosh.tekuri@gmail.com>
- RE: [xml-dev] I have implemented SAX based XPath Engine
  - From: "Michael Kay" <mike@saxonica.com>
- Re: [xml-dev] I have implemented SAX based XPath Engine
  - From: Philippe Poulard <philippe.poulard@sophia.inria.fr>
- RE: [xml-dev] I have implemented SAX based XPath Engine
  - From: "Michael Kay" <mike@saxonica.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]