OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] XPath 1.5? (was RE: [xml-dev] typing and markup)

[ Lists Home | Date Index | Thread Index ]

At 12:05 AM 5/9/2002 +0100, Jeni Tennison wrote:

>Yes, indeed; people writing stylesheets want them to be fast as well.
>Mostly, though, people who understand XPath understand that using the
>expression //address is a surefire way to make your stylesheet take
>ages. Using the step-by-step paths makes the stylesheet quicker and
>makes it easier for someone else maintaining it to understand what
>addresses are actually being processed. Encouraging people to write
>the "easy path" means that when they come to writing a stylesheet for
>a markup language with no schema, or move to a processor that doesn't
>support this particular optimisation, they'll create stylesheets that
>are very slow. I'd rather have processors warn users when they spot
>expressions like these than have them rewrite them silently, however
>effectively.

Hi Jenni,

It's clearly true that //address is easier, and requires less precise 
knowledge of the structure of the data. Calling it "the easy path" implies 
that it is not the right way to go, but for data that is governed by a DTD 
or schema, for stylesheets that are compiled, I think that the main reason 
not to use // is that the tools currently do not exploit schema and DTD 
information for optimization.

Since DTDs do change, and schemas are combined, I often prefer to write 
queries that do not depend on absolute paths in documents.  The query 
"//author" can find authors with a variety of structures found at many 
places in various document structures. I think that the ability to write 
expressions based on partial knowledge of document structures is very 
useful. I would like that to be reasonably fast, which requires 
schema-based optimization. Naturally, this will *not* be fast when no 
schema or DTD is present.

As you point out, for stylesheets that are not precompiled, the 
optimization step takes time each time the stylesheet is to be run. If the 
stylesheet is meant to be run more than once (which most stylesheets are), 
perhaps a utility that rewrites a stylesheet to use optimized patterns is a 
generally useful tool.

> > If the user did not do static type checking, this would be
> > discovered at run time, not during static analysis.
>
>Right -- which as we've discussed, is at the same time for most XSLT
>transformations, so this isn't a particular advantage of static type
>checking in XSLT's case.

I'm not so sure. I continue to encounter errors in widely used XSLT 
stylesheets, including the XMLSpec DTD, that result in invalid HTML when I 
write a document with a structure that has apparently not yet been tested. 
This often involves quite straightforward structural errors which I believe 
*could* be caught by static analysis.

Perhaps you don't want this when you run XSLT, it might be more useful as a 
standalone 'lint' utility. This 'lint' utility might even be part of the 
same tool that optimizes your patterns based on a schema.

>Currently, of course, XSLT processors will happily process such a
>stylesheet, returning an empty node set if an XPath doesn't locate any
>nodes even if, logically, they could access a DTD to provide them with
>information about what nodes can validly be present. There's a debate
>here about whether it's better to produce an unexpected result or to
>produce an error. XSLT has previously fallen on the side of producing
>an unexpected result for tests that involve the particular markup
>language, as opposed to the fixed types of functions or operators.

Which means that somebody has to read the output carefully to see if there 
are errors. It's pretty easy, in my experience, to be unaware of errors in 
the output of an XSLT stylesheet.

>I think that's because if you protect people from one error at the
>markup language level, they might think that you're protecting them
>from every error. For example, if your address type contained a
>postcode child element that was defined, in the schema, to match the
>pattern:
>
>   "[A-Z]{2}[1-9] [1-9][A-Z]{2}"
>
>then doing:
>
>   address[starts-with(postcode, 'W1')]
>
>could logically also give you an error. A user might ask why this
>doesn't raise an error, when other assertions within a schema do.

This feels like all or nothing thinking to me. We should be clear with our 
users that we don't catch all errors. No query or programming language 
does. But most do catch some errors. Catching more errors, rather than 
fewer, is a good thing. If we want to make it plain to the user that no 
errors will be caught until the relevant code is invoked on data that 
exposes the bug, I think XSLT does far too much error checking already.

I am always happy to remove a bug from a program even if there may still be 
another bug.

>The "we can tell this stylesheet will produce valid html from docbook"
>fallacy is just the kind of misconception that arises when you think
>that static type checking means you know everything about validity.
>There are several aspects of HTML that can't be validated by
>grammar-based schema languages, such as the fact that form elements
>shouldn't occur within other form elements at any level (a constraint
>Schematron can model nicely, of course). And even if, for simple
>languages, you could guarantee that you produce a valid document does
>not mean that document *makes sense* semantically.

Yes, there are clearly constraints that will not be caught by static 
analysis. Further tools for constraint checking are useful and important.

> >>Especially as there are lots of *disadvantages*, such as the added
> >>complexity in the processors and in the spec to deal with all the
> >>different kinds of casting and validating of complex types.
> >
> > I would like to see more information on the added complexity people
> > anticipate in processors. Since static analysis is optional, it does
> > not give overhead if omitted. Optimization based on static analysis
> > is also optional, and nobody should implement an optimization that
> > is not more optimal.
>
>What I'm arguing is that there is an overhead for users and
>implementers of XPath 2.0 whether or not processors implement
>optimisations. Implementers have always been free to carry out
>whatever optimisations they want to, and with that freedom have
>provided quite a lot (though despite over two years of reasonably
>competitive development, none that I know of take the trouble of
>examining a DTD to provide precisely the kind of information that you
>claim would save so much time).

Our specs do not prescribe any optimizations whatsoever. Implementors have 
this freedom.

I do know of XPath implementations that perform DTD based optimization. I 
don't want to name names, but these are systems that use XPath as a 
standalone language for querying persistent data. I don't know whether any 
XSLT processors do this.

>In particular, support for the cast, treat, assert, and validate
>expressions over complex types, which require support for the
>technicalities in the XQuery Semantics, is a major implementation
>effort and an overhead in a running system.

These *do* add a lot of complexity, and in the context of XSLT, I also 
wonder how much bang for the buck the give us. XQuery clearly needs them.

This is, of course, a matter for use cases to sort out ;->

>As far as I can tell,
>implementers can't use information from the PSVI (i.e. an existing XML
>Schema validator) here; but have to write their own validators in
>order to perform both the static and dynamic checks that are required
>to evaluate these expressions.

At least some schema validators do make the PSVI information available (via 
regrettably proprietary interfaces), so I don't see why this information 
can't be exploited. Again, it might make more sense to use a separate 
"lint-and-optimizing-rewrite" tool to check and optimize a stylesheet 
rather than do this every time a stylesheet is executed.

>As well as recording the name and type
>derivation of the top-level types in an XML Schema, they have to
>resolve the content models so that they have something against which
>they can check the content of the elements that they generate or
>query. They have to implement the analysis described in the XQuery
>Semantics so that they can tell whether one type is the subtype of
>another type, and, naturally, be able to validate an element against
>one of these types again (I think) using the XQuery Semantics rather
>than an existing XML Schema processor. That is the added complexity I
>am concerned about, but heck, I'm not an implementer -- maybe this is
>child's play.

The information needed to support types is, I believe, completely available 
in the PSVI. A non-optimized implementation can probably implement validate 
{ } by small modifications to an existing schema processor.

Of course, our current approach to typing is significantly different from 
previous drafts, so we are really waiting for implementation feedback on this.

> From the user perspective, we have to *understand* all this stuff so
>that we can work out what we have to do to make an XPath that isn't
>working work. Having read it several times, it's still hard for me to
>grasp what the difference is between 'treat' and 'assert' (though
>there's a vast improvement over the text in the last version), and I
>can't imagine the problems for new users will be that much better.

Thunk.

Yes, this is an issue. And if it's hard for Jenni Tennison, it's gonna be 
hard for a lot of people.

>I'm not against XQuery processors having their own validation model,
>and from the little I've seen of it the complex type checking that's
>provided by XQuery looks really neat. I just seriously doubt that the
>extra implementation and user effort is worthwhile for XPath 2.0.

Would you really suggest using *none* of the type operators, or are there 
some that you think would be worthwhile if they were easy to implement? I 
suspect that any XSLT processor that has access to the PSVI would find 
'treat' and 'cast' reasonably easy to implement - 'cast' requires facet 
checking, but this amounts to about 10 relatively simple functions.

Jonathan





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS