xml-dev - A syntax for locators (WAS Re: more QName madness)

A syntax for locators (WAS Re: more QName madness)

[ Lists Home | Date Index | Thread Index ]

To: xml-dev@lists.xml.org
Subject: A syntax for locators (WAS Re: more QName madness)
From: Joe English <jenglish@flightlab.com>
Date: Thu, 21 Nov 2002 13:57:21 -0800
In-reply-to: <200211191232.HAA28920@mail2.reutershealth.com>
References: <200211191232.HAA28920@mail2.reutershealth.com>

John Cowan wrote:

> Joe English scripsit:
>
> > Sarcasm aside, I could devise one.  So could you, and so could any of
> > the individual members of the Linking WG.  Not trivially easy,
> > but a good deal simpler than the proposed framework and with
> > most of the expressive power.
>
> No sarcasm intended.  I would be most interested in such a proposal
> or even a sketch.

Here's a method I've been using on a few internal projects.

(Caveat: following the YAGNI principle [1] I've only implemented
the bits of this that I've actually needed to date, but implementing
the full thing looks to be pretty straightforward.)

Syntax:

locator ::= /* empty */
	|   locator '/' step
	|   locator '//' step
	;

step	::= selector
	|   NCName '(' selector ')'
	;

selector::= Ordinal		/*  = [1-9][0-9]*, interpreted as an integer */
	|   '@' NCName '=' Literal
	;

NCName	::= /* ... the usual */ ;
Literal	::= /* ... the usual */ ;

Semantics:

A _locator_ takes as input a single XML node and returns
at most one XML node.  A _step_ takes a list of nodes
and returns at most one element node.

The base case, an empty locator, returns the input node
(typically the document root).  "loc / step" evaluates _loc_,
then applies _step_ to the list of child element nodes of the 
result.  "loc // step" applies _step_ to the list of proper 
descendants (in document order).  In the latter two cases, 
if _loc_ fails then the expression as a whole fails.

"NCName(selector)" selects only those element nodes in
the input list which have a matching local-name, then
applies _selector_ to the filtered list.

An ordinal number _n_ returns the _n_th node in the
input list (starting from 1); it fails if the list has
fewer than _n_ elements.  "@name='value'" selects
the first element in the input list with an attribute
having a local-name of _name_ and a matching value, failing
if there is no such element.

That's about it.

Notes:

The syntax is simple enough that it can be parsed with regexps,
and it can be implemented with a streaming processor (e.g., a
SAX Filter) without lookahead or backtracking.  The target
element can be identified as soon as its start tag is seen.

The notation covers a broad range of use cases.  It can address
any element in the tree using only ordinal selectors
and the "/" operator (like the XPointer "element" scheme
or HyTime treelocs).  The "NCName(ordinal)" form allows
for more human-readable and human-writable locators,
e.g., "/document(1)/chapter(2)/section(1)".

"//@name=value" can be used to locate elements by ID
(XPointer "shorthand pointer" or HyTime "nameloc").
Since the (local-)name of the ID-bearing attribute is specified
in the locator itself, the consumer doesn't need to know
about schema-determined, DTD-determined, or externally-determined
IDs.  I haven't come up with a use case where the producer
of a link (a) knows the ID of the desired element but (b)
doesn't know the name of the ID-bearing attribute, so
(with the exception of a few namespace-related pathologies)
there is no loss of expressivity.

The "@name=value" form can also be used for attributes that
have ID- or key-like semantics but aren't defined as IDs in a schema
or DTD.  For example, in HTML two <input>s in different <form>s can 
have the same @name, so the name attribute has declared value CDATA.
These are addressible with locators like:

    //form(3)//input(@name='credit_card_number')

[ Hm... two <input>s in the _same_ form can also have the
  same name...  this scheme won't work to locate those. ]

Lastly, this allows you to write very compact (but of course
very fragile!) locators: "//N" selects the Nth element in
document order.

Locators have a nice associative property: if _loc1_ and _loc2_
are locators and _node_ is the input node, then:

    locate (loc1, locate(loc2, node)) = locate (loc1 ++ loc2, node)

where ++ is string concatenation.

Locators only return element nodes, so they don't meet
all the XPointer requirements [2].  They can be used as
a prefix in a more general pointer scheme though,
something like:

pointer	::= locator
	|   locator '/' '@' Name	/* select an attribute */
	|   locator '/' '$' ...something...	/* select text nodes */
	|   locator '/' '?' ...something...	/* select PIs */
	|   ... other stuff ...
	;

(I've implemented the first one, but haven't needed the
others yet so haven't given them much thought -- YAGNI again.)

I can think of a good reason _not_ to support ranges though.
A good way to implement bidirectional links is to annotate
each node with a list of all the locators that point to it,
so it's easy to traverse back and forth across arcs.
Things get hairier if a locator can point to a range
of nodes or to a character span.

On the QName problem:

The most radical (over-?)simplification is that it only
examines the local-name, not the expanded name or QName.
Putting namespace names in locators is way too verbose,
and using QNames leads to the usual problem of how to
determine the namespace context.

The solution I like best would be to use the namespace
context of the _target_ document.  Amy Lewis makes a
compelling argument for this approach in [3].  The only
real drawback is that it's only reliable if the target
document is sane [4]; otherwise you can end up counting
more elements than you intended (neurosis) or skipping
ones that should be counted (borderline).  Further, the
possibility of psychosis means that you have to do the
full-blown QName-to-expanded name processing and compare
URIs instead of doing a simple lexical comparison against
the original QName.

Given all that, the marginal benefit of being able to
match expanded-names instead of just local-names didn't
seem worth the added effort.

The syntax is intentionally incompatible with XPath
(parentheses instead of square brackets) because the
semantics are different.  The main differences are
that 'foo' matches 'pfx:foo' in locators but not
in XPath, and '//foo[@name="value"]' can match multiple
elements in XPath whereas '//foo(@name="value")' only
matches the first one.  (The main reason though is that
square brackets are magic characters in my Language of
Choice for XML processing, and parentheses don't
need to be escaped).

[1] YAGNI: <URL: http://www.xprogramming.com/Practices/PracNotNeed.html >

[2] XPointer requirements: <URL: http://www.w3.org/TR/NOTE-xptr-req >

[3] compelling argument: <20021114182802.GA6480@talsever.com>,
    <URL: http://lists.xml.org/archives/xml-dev/200211/msg00549.html >

[4] sane: <URL: http://www.flightlab.com/~joe/sgml/sanity.txt >

--Joe English

  jenglish@flightlab.com

References:
- Re: [xml-dev] more QName madness
  - From: John Cowan <jcowan@reutershealth.com>

Prev by Date: Re: [xml-dev] Representing Code Lists
Next by Date: Re: [xml-dev] Representing Code Lists
Previous by thread: Re: [xml-dev] more QName madness
Next by thread: Re: [xml-dev] more QName madness
Index(es):
- Date
- Thread