OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] XPath 1.5? (was RE: [xml-dev] typing and markup)

[ Lists Home | Date Index | Thread Index ]

Hi Jonathan,

> It's clearly true that //address is easier, and requires less
> precise knowledge of the structure of the data. Calling it "the easy
> path" implies that it is not the right way to go, but for data that
> is governed by a DTD or schema, for stylesheets that are compiled, I
> think that the main reason not to use // is that the tools currently
> do not exploit schema and DTD information for optimization.

As I said, I think that step-by-step paths (//author is an absolute
path -- it starts at the document node) are also easier for people to
follow. I spend a lot of time going through other people's
stylesheets, having little knowledge of the markup language that
they're using, and paths that use // can make it very hard to
understand how the document is structured, what that means in terms of
what's generated, and where changes might be required should the
markup language or the desired output change in the future.

> Since DTDs do change, and schemas are combined, I often prefer to
> write queries that do not depend on absolute paths in documents. The
> query "//author" can find authors with a variety of structures found
> at many places in various document structures.

I would have thought that, in a real life version of the example that
we're talking about, it would be just as likely that you wouldn't want
that new address to be displayed as that you would. And that being the
case, you'll generally have to rewrite your paths anyway.

> I'm not so sure. I continue to encounter errors in widely used XSLT
> stylesheets, including the XMLSpec DTD, that result in invalid HTML
> when I write a document with a structure that has apparently not yet
> been tested. This often involves quite straightforward structural
> errors which I believe *could* be caught by static analysis.
>
> Perhaps you don't want this when you run XSLT, it might be more
> useful as a standalone 'lint' utility. This 'lint' utility might
> even be part of the same tool that optimizes your patterns based on
> a schema.

Absolutely. This kind of analysis and assistance to XSLT authors is
great at *authoring time*. It shouldn't have to be embedded in the
XSLT processors, which should instead be lean and mean and focus on
the job of transformation rather than schema analysis and validation.

>>I think that's because if you protect people from one error at the
>>markup language level, they might think that you're protecting them
>>from every error. For example, if your address type contained a
>>postcode child element that was defined, in the schema, to match the
>>pattern:
>>
>>   "[A-Z]{2}[1-9] [1-9][A-Z]{2}"
>>
>>then doing:
>>
>>   address[starts-with(postcode, 'W1')]
>>
>>could logically also give you an error. A user might ask why this
>>doesn't raise an error, when other assertions within a schema do.
>
> This feels like all or nothing thinking to me. We should be clear
> with our users that we don't catch all errors. No query or
> programming language does. But most do catch some errors. Catching
> more errors, rather than fewer, is a good thing. If we want to make
> it plain to the user that no errors will be caught until the
> relevant code is invoked on data that exposes the bug, I think XSLT
> does far too much error checking already.
>
> I am always happy to remove a bug from a program even if there may
> still be another bug.

You're right, it was rhetoric. I'm finding it hard to express why this
static typing stuff makes me feel so unnerved. It comes down to the
fact that I don't want to have to jump through hoops to create
stylesheets if those hoops don't give me a tangible benefit, and I
don't want the processors I use to be jumping through those hoops
either.

The argument that I'm hearing is that the benefit of jumping through
the strong typing hoops is the predictability of the transformation
result and the optimisability of the XPath queries. What I was trying
to say above is that the former benefit is not nearly so great as the
designers of XPath seem to think. That doesn't mean there's no
benefit, just as I agree that there is some benefit in optimising
XPath queries. I just don't think that there's sufficient benefit for
the cost that will be made in terms of implementation and user effort.

> I do know of XPath implementations that perform DTD based
> optimization. I don't want to name names, but these are systems that
> use XPath as a standalone language for querying persistent data. I
> don't know whether any XSLT processors do this.

As we've discussed, that's a radically different situation from the
majority of XSLT transformations, or indeed other uses of XPath, such
as XPointer. I expect that XPath implementations that perform
DTD-based optimisation will become XQuery implementations rather than
XPath 2.0 implementations.

>>In particular, support for the cast, treat, assert, and validate
>>expressions over complex types, which require support for the
>>technicalities in the XQuery Semantics, is a major implementation
>>effort and an overhead in a running system.
>
> These *do* add a lot of complexity, and in the context of XSLT, I
> also wonder how much bang for the buck the give us. XQuery clearly
> needs them.
>
> This is, of course, a matter for use cases to sort out ;->

The closest that XPath 2.0 has to use cases is a bunch of
requirements. I can't see anything in that which indicates that cast,
treat, assert or validate is required for XPath 2.0, although there
might be technical reasons that I haven't seen. Part of the point of
discussing this is to learn what makes the WGs think that these are
required.

>>As far as I can tell, implementers can't use information from the
>>PSVI (i.e. an existing XML Schema validator) here; but have to write
>>their own validators in order to perform both the static and dynamic
>>checks that are required to evaluate these expressions.
>
> At least some schema validators do make the PSVI information
> available (via regrettably proprietary interfaces), so I don't see
> why this information can't be exploited. Again, it might make more
> sense to use a separate "lint-and-optimizing-rewrite" tool to check
> and optimize a stylesheet rather than do this every time a
> stylesheet is executed.

I (or the XQuery Semantics WD) might be behind the times, but the
current WD indicates that "XML Schema is based on named typing, while
the XQuery type system is based on structural typing." The definition
of a "subtype" in the XQuery Semantics WD is not the same as a
"derived type" in XML Schema. That's why I say that implementers need
to implement this validation themselves rather than reuse the code of
XML Schema validators.

I agree that a separate stage of linting and optimisation would be
more useful.

> Would you really suggest using *none* of the type operators, or are
> there some that you think would be worthwhile if they were easy to
> implement? I suspect that any XSLT processor that has access to the
> PSVI would find 'treat' and 'cast' reasonably easy to implement -
> 'cast' requires facet checking, but this amounts to about 10
> relatively simple functions.

Let me see if I can describe what I think each of these expressions
are supposed to do; after all, I might be misinterpreting what they're
supposed to be useful for.

First, "instance of". I think that "instance of" meets the requirement
of being able to select elements based on their type, which is
something listed in the XPath 2.0 Requirement document. So I think
that this is worthwhile. On the other hand, I think it should be based
on named typing rather than structural typing, firstly because I think
that's simpler for implementers (they can look at the PSVI to work out
whether one type is derived from another) and secondly because I think
it would be frustrating for users if they don't have control over what
types count as subtypes of each other. For example, if I have the
following two element declarations:

<xs:element name="address" type="addressType" />
<xs:complexType name="addressType">
  <xs:sequence>
    <xs:element name="line" type="xs:string" />
  </xs:sequence>
</xs:complexType>

<xs:element name="poem" type="poemType" />
<xs:complexType name="poemType">
  <xs:sequence>
    <xs:element name="line" type="xs:string" />
  </xs:sequence>
</xs:complexType>

then if I do:

<xsl:template match="*[. instance of element of type addressType]">
  ...
</xsl:template>

then I want to select those elements of the addressType, not the
poemType. The structure of addressType and poemType might be the same,
but they do not have the same semantics.

Second, the "cast" operator. From what I can tell, cast is used to
cast one simple type to another simple type. In that way, it's similar
to the XPath 1.0 functions of string(), number() and boolean(). Now
XPath 1.0, and XPath 2.0 in XSLT, has a flexible type exception
policy, so most of the time explicit casting from one type to another
isn't required. It's fairly rare to need to cast in XPath 1.0; the
times when it's necessary are:

  - when you want to test whether an element has a string value, as
    opposed to whether the element exists (i.e. test="string(foo)"
    rather than test="foo")

  - when you want to test whether the value of a node is numeric (i.e.
    test="number(foo)" rather than test="foo")
    
  - when you want to use the numeric value of a node within a
    predicate (i.e. select="foo[number(bar)]" rather than
    select="foo[bar]")

  - when you want to sort a bunch of nodes based on whether they have
    a particular characteristic or not (i.e. in xsl:sort,
    select="boolean(foo)" rather than select="foo")

The first is really a shorthand for test="foo != ''". The last is only
an issue because you can't have data-type="boolean" in XSLT 1.0; that
isn't an issue in XSLT 2.0 because you could use
data-type="xs:boolean". In XPath 2.0, the second should be done with
test="foo instance of element of type xs:decimal" instead, I think.
The third can't be done in any other way.

The question is whether we'll ever need to explicitly convert a value
to other kinds of values. I can think of potential use cases, but
neither are compelling:

  - to print out the canonical representation of a particular data
    type (but then there should be format-number() and format-date()
    etc. functions for them)

  - to test whether a node is of a particular type (but then there's
    the "instance of" expression for that)

I haven't yet seen a good use case, and unless there is one I think
that cast should be omitted.
    
Now onto the difficult ones, "treat" and "assert". From what I can
tell, "treat" states that the type of a node or value is a supertype
of a given type, whereas "assert" states that the type of a node or
value is a subtype of a given type. The only benefits that I can see
from these is that it means the processor might reject certain things
during compilation (and as above I think that this should be done by a
separate tool), and that explicitly casting one complex type to
another enables optimisation etc.

I looked for use cases for these in the XQuery document to try to see
where it might be helpful. There isn't a use case that involves
"assert"; the use case for "treat" seems to be that it prevents the
processor from complaining when you try to access a child node that,
according to the schema, shouldn't be present for a node of a
particular supertype. Since I don't think XSLT processors should be
raising errors in those kinds of situations anyway, I don't see the
point of supporting either of these expressions.

Finally, "validate", which takes the result of an expression and
validates it, usually in some context. Again there's no use case in
the XQuery Use Cases document, so it's hard to tell how the WGs are
imagining this will be useful. The only thing that I can think of is
that this is a way of adding default values to elements and attributes
that you generate; but then, if you're generating those nodes, surely
you can indicate what type they are when you generate them rather than
taking an extra step to do so? So again, I don't see any reason for
validate to be present in XSLT, but there might be one that I'm not
aware of.

Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS