[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] I have implemented SAX based XPath Engine
- From: Philippe Poulard <philippe.poulard@sophia.inria.fr>
- To: Michael Kay <mike@saxonica.com>
- Date: Fri, 20 Feb 2009 18:14:08 +0100
Michael Kay a écrit :
>> I have implemented few years ago such a tool, the global
>> design is described here:
>> http://reflex.gforge.inria.fr/saxPatterns.html
>
> Thanks, I had seen this article before but it was worth rereading.
>
> I'm puzzled by the concept of keeping exactly four counters. For an
> expression like
>
> child::p[@a=17][5]
you're right, and grouping is not supported either:
this one should work:
/foo/*/*[last()]
but this one doesn't :(
(/foo/*/*)[last()]
> you need a counter of how many nodes have satisfied the @a=17 predicate. I
> would have thought it was better to generalize this so that a counter is
> always relative to a predicate, for example p[17] is treated as
> child::node()[self::p][17] and the counter is relative to the predicate
> [self::p].
great idea !
I will correct my documentation about those unsupported cases
> Doing this using threads does look like a recipe for adding performance
> overheads. (Saxon's streaming sometimes uses two threads, but never more).
I use 2 threads for each filter, and in a pipeline with n filters, there
will be n+1 threads
I
> would have thought the great advantage of using a push model is that you can
> push the same events to multiple destinations without the use of multiple
> threads. Saxon does this when validating instance documents against XML
> Schema identity constraints, where any number of constraints can be active
> at any time. I agree with your comment, however, that the ideal would be to
> compile some kind of finite state machine. That would work, of course,
> regardless of whether the evaluation is in push or pull mode.
>
> The XSL WG has been doing work on streaming over the last year or so (hence
> my interest). There's nothing available to publish yet, but we've made some
> good progress and have built up quite a substantial collection of use cases.
> We've been trying to define constructs that are amenable to "pure streaming"
> without any buffering or lookahead, but it's difficult to draw the
> boundaries.
Very interesting, I look forward to read all that !
Best regards,
Philippe
>
> Michael Kay
> http://www.saxonica.com/
>
>> It is designed to filter SAX streams with XPath-based
>> patterns, but has also useful filters to process text
>> (non-XML) inputs:
>> http://reflex.gforge.inria.fr/tutorial-pipelinesAndFilters.html
>>
>> XPath is rather well supported: position() and last() are
>> supported but it doesn't support preceding:: and
>> preceding-sibling:: axis except in very few circumstances,
>> but I also propose a workaround when filtering a SAX stream
>> (juggling with a local DOM subtree when necessary).
>>
>> To answer to Michael about how predicates are evaluated when
>> reading forward is required, the engine uses a lookahead
>> buffer and goes on reading until the actual predicate becomes
>> evaluable; for that purpose, as explained in the article, the
>> engine uses coroutines that are implemented using threads
>> (one to evaluate the predicate, the other to hold the
>> position in the call stack of the current startElement()
>> event); I think that a finite state machine based on a pull
>> parser would be much more efficient: although the stuff works
>> somewhat well, I have noticed that it runs slowly when I use
>> lots of XPath patterns in a pipeline made of lots of filters,
>> and it can be an issue when reading GB of XML. I also know
>> that things here and there have to be optimized, for example
>> instead of evaluating the entire set of XPath patterns on
>> each event, I could recognize that a subset is irrelevant for
>> a given branch and I should discard them in that branch (but
>> currently it doesn't work like that); there are also things
>> to do better about partial evaluation specifically when
>> comparison operators are involved, for example [count(foo)>9]
>> should exit when 10 <foo>s are met rather than when the
>> 1000000 specimen are read. I have imagined a strategy where
>> the count() function should return something different than a
>> number, a numeric object evaluable several times by the
>> operator that could fetch more data on demand, until the
>> NumberThatIsAtLeast object reach (or not) the expected value.
>> Lots of work in sigth.
>>
>> Of course I will have a look at Santhosh's work :)
>>
>> --
>> Cordialement,
>>
>> ///
>> (. .)
>> --------ooO--(_)--Ooo--------
>> | Philippe Poulard |
>> -----------------------------
>> http://reflex.gforge.inria.fr/
>> Have the RefleX !
>
--
Cordialement,
///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]