[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: We need an XPath API

From: Charles Reitzel <creitzel@mediaone.net>
To: xml-dev@lists.xml.org
Date: Mon, 05 Mar 2001 10:47:39 -0500 (EST)
On Sat, 03 Mar 2001 David Megginson wrote:
>Charles Reitzel writes:
> > Proposal: let's give XPath the SAX treatment.
>
>I'd actually recommend giving XPath the DOM treatment.  

Agree.  I actually meant "SAX Treatment" in the process sense, rather than
from an API design POV.

On Sat, 03 Mar 2001 Thomas B. Passin wrote:
>We need some requirements engineering here.  Especially, 
>what would the API be used for?

There have been a bunch of interesting responses to this idea.  Let me
briefly respond to see if I have captured concepts accurately.  I've also
added some comments about the DOM and SAX helpers.

I'll take a deeper pass on the interfaces in the next couple days, starting
w/ a cross-reference between CSS2 selectors and conditions to XPath expressions.

Please let me know if I have misunderstood or otherwise misrepresented
anyone's intent.

take it easy,
Charles Reitzel

============================================================


Feedback:


1) Give it the DOM treatment, rather than SAX treatment

On Sat, 03 Mar 2001, David Megginson wrote:
>I'd actually recommend giving XPath the DOM treatment.  
>Well, not really DOM, but maybe a cleaner, in-memory tree.
>XPaths (even hairy ones) are extremely small, and the 
>same path object is likely to be reused many times, so I 
>see no need to force the pain of an event-based interface 
>on users (unless someone thinks we're going to
>be seeing gigabyte-long XPath expressions).

I don't see the need for callbacks at all at the expression level.  Better
just to slurp up a string and be done with it.  XPath expressions appear as
either attribute or element values. So even dealing with an InputSource
seems unnecessary in an early version.

This relates to SAC (see below) as well, In that SAC allows for registering
handlers, etc.  Rather, I see XPathExpr as objects that could be
instantiated in SAX event handlers or from a DOM node element and used to
lookup the datatype in the schema.  I.e. these are lightweight, possibly
transient objects.

Note this applies only at the expression level and not to DOM or SAX
helpers, per se.


2) SAC as prior art from Simon St. Laurent and Robin Berjon

On Sat, 03 Mar 2001, Simon St. Laurent writes:
>Just for prior art, there's a Simple API for CSS.  Don't 
>know if there'd be any overlap at all, but you never know:
>http://www.xmlhack.com/read.php?item=685
>http://www.w3.org/TR/SAC/

Yes, clearly there is some conceptual overlap.  My reading of CSS1 and CSS2
shows no references to XPath, however.  So, we have two independent W3C
XPath-like syntaxes (hmmph).  Perhaps the big difference is the HTML legacy
baggage.  It also seems that CSS2 is not used much compared to XSLT+XPath.

I think SAC should be scanned for condition and selector types.  It is
probably worthwhile to list CSS2 <-> XPath1.0 equivalents.  Also, the
representation of XPath expression parts in my posting was clearly weak.
Emulating the SAC Selectors, Conditions and their respective factories looks
good.  Perhaps we could even use these interfaces directly.  I'm wary of
unwanted dependencies, however.

To be clear, this is for XPath, not CSS.

On Sun, 04 Mar 2001, Robin Berjon wrote:
>I've been down the "XPath OM" path before by hacking an 
>interface onto Matt Sergeant's XML::XPath module. It can 
>be very useful, but not as useful in some contexts as a 
>builder callback style interface. Converting an object 
>model into another is often harder than simply handling 
>builder events. That's why CSS has SAC and DOM2 interfaces. 
>Both are useful, but anyone wishing, say, to build a custom
>selector object to get elements out of his own type of tree 
>will probably use SAC.

I think we may have a different use case here for CSS that is unlikely to
apply to XPath.  When pulling in an entire CSS stylesheet, I can see the
sense of the callback approach.  But I don't know if anyone will parse
documents consisting of XPath expressions only.  They are typically found as
attribute or element contents within XML.  So parse the document however you
will and, when you encounter an XPath expression, construct an XPathExpr
object and work with it.  

Unlike a DOM, these objects should be small enough that there is little, if
any, wasted effort.  Also, lazy evaluation is always an option.


3) DOM3 issues from Mike Champion

This starts getting interesting.  I didn't get a chance to digest all of the
issues in detail.  But I think a good guiding principle is, perhaps, "This
is XPath not DOM, CSS, et al."  

(non-sequiter aside: anyone know "This is Boston, not L.A.?")

 3a) Namespaces

On Sat, 03 Mar 2001 17:58:47, Mike Champion wrote:
>- A disagreement whether to do something minimal (a la
>Microsoft selectNodes) or a fully-functional XPath API.  
>Not surprisingly, the minimal solution falls afoul of
>namespaces in all sorts of nasty ways; MS has some 
>workarounds, but they are not terribly elegant.  A fully
>functional solution requires some mapping between the 
>XPath and DOM conceptions of a namespace declaration.

I don't know if this helps, but the QName doesn't need resolving until the
the XPath expression is actually evaluated.  I.e. you can parse the
expression, which would probably only include NS prefixes (or not).  At
evaluation time, the NS URI for the prefix is a moving target.  You can't
forget about the default NS, either.  The exact handling of this evaluation
will, of course, be different when looking in a DOM or responding to a SAX
callback.

The Apache SOAP NSStack idea might be helpful here.

 3b) XPath vs. DOM Data Model

>The wretched inconsistency between the DOM data model and 
>the XPath data model.  DOM "trees" can have CDATA nodes, 
>adjacent Text nodes, entity reference nodes (and maybe some 
>other rot) that is transparent to XPath. So, an XPath 
>expression can point at something that is not neatly aligned 
>on DOM Node boundaries ... so what should a NodeList or
>NodeIterator returned by an XPath expression do?

I'd say dish up the XPath data model when returning nodes in a DOM matching
an XPath expression.  Combine nodes as needed.  If the original nodes are
needed, get them via DOM calls.

I don't understand yet how an XPath expression can point to something "not
neatly aligned on DOM Node boundaries".  To hazard a guess, is it be related
to unexpanded external entities? In which case, "you can't get there from
here" may be a reasonable answer from the library.

 3c) Live NodeLists

>The obvious thing for something like selectNodes to 
>return would be a NodeList, but keeping this in synch 
>with the XPath expression as the underlying DOM tree is
>edited is non-trivial.  NodeIterators are probably a 
>better idea, but they are less familiar and less widely
>supported, and still have some "liveness" semantics that
>might be problematic here ... not sure.

Just use iterators.  I can't think of an implementation language that
doesn't support some form of iterator (Java, C, C++, Perl, JavaScript,
Visual Basic, Python?).  E.g. a list or vector in most script languages is
just a list of references to the live object.

Iterator staleness is a problem w/ all query result sets.  I.e. the database
row can get deleted out from under the cursor.  A set member can be deleted,
leaving a dangling reference in a iterator.  There are no perfect solutions
to this problems and developers all learn about it after they stub their
toes a few times.

One small step further, a Java implementation should use the Java 2
Collections.  They are compatible w/ Java 1.1.8 (available in a separate JAR
file) and provide much improved synchronization options (including stale
iterator detection when used in a synchronized block - aka fast-fail).


4) "Just use SAX" from Eric van der Vlist and Sean McGrath

In theory, you could generate an XML equivalent to the XPath expression and
parse that.  Question, how do you generate the XML?  I think you'll need to
parse the XPath first.  Better to define an internal representation of the
XPath expression and parse/emit any supported syntax.  If that syntax uses
XML, then SAX is a great implementation strategy, but not a useful API for
XPath expressions by itself.

Certainly, an XML syntax is not the highest priority.  If it gets at all
controversial, better to scrap it.  (Ducking pie thrown by Jonathan Robie).

5) Need XPointer Support

On Sat, 03 Mar 2001 Thomas B. Passin wrote:
>1) Parse and process the XPointer syntax. This would be 
>   useful for developers to create XPointer applications 
>   and toolkits.
>
>2) Return node-sets.  This is more like a query capability, 
>   and would be more useful for application writers.
>
>3) Construct XPointer expressions based on some existing 
>   tree (fragment).
>
>4) Construct XPointer expressions based on a schema
>   (fragment?)

Doesn't XPointer just use XPath?  In which case, the lib should be able to
do these things.  I guess this starts getting into XSLT and
XPointer-specific extensions to XPath.  This probably calls for a couple
SAX-style extension identifier URNs.  So an app can say "I need XSLT 1.1
XPath extensions" and the parser can say yes or no.


6) DOM/SAX Wrappers

I have also used Matt Seargent's XML::XPath module as well - with excellent
results.  It's a real nice module.  It is also what triggered my original
posting.  Specifically, I could related directly to Joe English's comment
about mismatched data structures making it tricky to combine modules.

Parsing XPath expressions shouldn't be terribly difficult, but you've got to
get it right.  It is worth putting in a module by itself.  Once you have it,
you need to be able to use it in different contexts, such as accessing a DOM
or extracting data from a stream.

For applications that build their own objects from an XML document, a
SAX-based approach seems best.  How to use XPath in this circumstance?  My
idea was to register a set of XPath expressions that identify objects of
interest.  When the XPathFilter encounters elements and attributes that
match one of the expressions, it will make the appropriate ContentHandler
call. The only difference from a vanilla ContentHandler is that it seems
necessary to pass back to the application the expression that was matched,
so the app has a clue what to do with it.

For applications that use the DOM, then I think XPath makes a highly useful
extension to the existing traversal functions. It is, if you will, a baby,
unoptimized query language.  It is an open question, however, if the
existing DOM traversal functions are sufficient to resolve all XPath
expressions.
Follow-Ups:
- Re: We need an XPath API
  - From: "Thomas B. Passin" <tpassin@home.com>
Prev by Date: Re: RSS 1.0 vs. RSS 0.9*
Next by Date: Re: PSVI
Previous by thread: RE: We need an XPath API
Next by thread: Re: We need an XPath API
Index(es):
- Date
- Thread