[
Lists Home |
Date Index |
Thread Index
]
Hi Jonathan,
> I think that optimization of // is a more compelling way to use
> knowledge of complex types. Suppose you have a pattern like this:
>
> //address
<niggle>
I think you mean 'expression' rather than 'pattern'. If you had a
pattern like that, the processor could optimise it to 'address',
because // never adds any information at the start of a pattern.
</niggle>
> Without knowledge of the complex types involved, this requires
> examination of all elements in the document to see if they are
> "address" elements. Looking at the schema for a particular invoice
> document, it is easy to see that the above pattern can only match
> shipping or billing addresses found in customers. The optimizer can
> rewrite the above pattern as follows:
>
> /customer/billing/address | /customer/shipping/address
>
> In at least some environments, this will be much more efficient to
> execute. Incidentally, the user does not see whether an
> implementation does this rewrite, the user only sees the increase in
> speed. Implementations should feel free to do whatever static
> optimizations they can, but not required to. Vendors will want to
> make their implementations fast.
Yes, indeed; people writing stylesheets want them to be fast as well.
Mostly, though, people who understand XPath understand that using the
expression //address is a surefire way to make your stylesheet take
ages. Using the step-by-step paths makes the stylesheet quicker and
makes it easier for someone else maintaining it to understand what
addresses are actually being processed. Encouraging people to write
the "easy path" means that when they come to writing a stylesheet for
a markup language with no schema, or move to a processor that doesn't
support this particular optimisation, they'll create stylesheets that
are very slow. I'd rather have processors warn users when they spot
expressions like these than have them rewrite them silently, however
effectively.
> The user is only affected when it comes to correctness in the use of
> complex types. Let's seque to query for a second. Suppose we have
> the same schema mentioned above, and the user writes the following
> function:
>
> define function customer-address(element customer $c)
> returns element address
> {
> $c/address
> }
>
> Static type checking will report that $c/address evaluates to an
> empty sequence, because the address element is always found in a
> billing or shipping element within customer. Static type checking is
> optional, but if the user asks for it, the system tells the user
> what is wrong with this query.
>
> If the user did not do static type checking, this would be
> discovered at run time, not during static analysis.
Right -- which as we've discussed, is at the same time for most XSLT
transformations, so this isn't a particular advantage of static type
checking in XSLT's case.
Currently, of course, XSLT processors will happily process such a
stylesheet, returning an empty node set if an XPath doesn't locate any
nodes even if, logically, they could access a DTD to provide them with
information about what nodes can validly be present. There's a debate
here about whether it's better to produce an unexpected result or to
produce an error. XSLT has previously fallen on the side of producing
an unexpected result for tests that involve the particular markup
language, as opposed to the fixed types of functions or operators. I
think that's because if you protect people from one error at the
markup language level, they might think that you're protecting them
from every error. For example, if your address type contained a
postcode child element that was defined, in the schema, to match the
pattern:
"[A-Z]{2}[1-9] [1-9][A-Z]{2}"
then doing:
address[starts-with(postcode, 'W1')]
could logically also give you an error. A user might ask why this
doesn't raise an error, when other assertions within a schema do.
The "we can tell this stylesheet will produce valid html from docbook"
fallacy is just the kind of misconception that arises when you think
that static type checking means you know everything about validity.
There are several aspects of HTML that can't be validated by
grammar-based schema languages, such as the fact that form elements
shouldn't occur within other form elements at any level (a constraint
Schematron can model nicely, of course). And even if, for simple
languages, you could guarantee that you produce a valid document does
not mean that document *makes sense* semantically.
>>Especially as there are lots of *disadvantages*, such as the added
>>complexity in the processors and in the spec to deal with all the
>>different kinds of casting and validating of complex types.
>
> I would like to see more information on the added complexity people
> anticipate in processors. Since static analysis is optional, it does
> not give overhead if omitted. Optimization based on static analysis
> is also optional, and nobody should implement an optimization that
> is not more optimal.
What I'm arguing is that there is an overhead for users and
implementers of XPath 2.0 whether or not processors implement
optimisations. Implementers have always been free to carry out
whatever optimisations they want to, and with that freedom have
provided quite a lot (though despite over two years of reasonably
competitive development, none that I know of take the trouble of
examining a DTD to provide precisely the kind of information that you
claim would save so much time).
In particular, support for the cast, treat, assert, and validate
expressions over complex types, which require support for the
technicalities in the XQuery Semantics, is a major implementation
effort and an overhead in a running system. As far as I can tell,
implementers can't use information from the PSVI (i.e. an existing XML
Schema validator) here; but have to write their own validators in
order to perform both the static and dynamic checks that are required
to evaluate these expressions. As well as recording the name and type
derivation of the top-level types in an XML Schema, they have to
resolve the content models so that they have something against which
they can check the content of the elements that they generate or
query. They have to implement the analysis described in the XQuery
Semantics so that they can tell whether one type is the subtype of
another type, and, naturally, be able to validate an element against
one of these types again (I think) using the XQuery Semantics rather
than an existing XML Schema processor. That is the added complexity I
am concerned about, but heck, I'm not an implementer -- maybe this is
child's play.
From the user perspective, we have to *understand* all this stuff so
that we can work out what we have to do to make an XPath that isn't
working work. Having read it several times, it's still hard for me to
grasp what the difference is between 'treat' and 'assert' (though
there's a vast improvement over the text in the last version), and I
can't imagine the problems for new users will be that much better.
I'm not against XQuery processors having their own validation model,
and from the little I've seen of it the complex type checking that's
provided by XQuery looks really neat. I just seriously doubt that the
extra implementation and user effort is worthwhile for XPath 2.0.
Cheers,
Jeni
---
Jeni Tennison
http://www.jenitennison.com/
|