Lists Home |
Date Index |
From: "Jonathan Robie" <email@example.com>
> Are you still startled? If so, I'm still listening...
Probably. I am a baby bunny transfixed by your sudden headlights :-)
When people talk about Schematron, they often seem to think it is just a
matter of simply evaluating a single XPath expression to boolean in some
way. For example, Examplotron provides an assertion mechanism, but
I don't consider it remotely similar to Schematron just because of that.
XML Schemas uses simplified XPaths for key-checking, but it is
no Schematron. Francis and Eddie's Schematron-embedded-in-XML-Schemas
allows some kinds of simple Schematron schemas to be embedded, but even
it is certainly not full Schematron. Just because an XPath returns
empty on fail is not the minimum requirement.
A Schematron schema has four parts:
2) Patterns and rules
3) Assertions (inside rules)
For the phases, there is no impact with XPath. What you are saying
is that there are only Pre-Schema Validation InfoSets and
Post-SchemaValidationInfosets, and that therefore all my Schematron
schemas should be either for the former or for the latter. So phases
can provide a way to cope with this, because I could, presumably,
make one phase for use on a PreSVI and one for a use on the
However, what about when someone constructs a document
by adding a branch from a PostSVI to a PreSVI, perhaps
using some XInclude implementation? Does the whole tree
become Pre or Post? For the Schematron user, they could
just make a phase to cope with some BastardSVI happily,
but not if the XQuery required only PreSVI or PostSVI.
Or is it that there can never be a hybrid infoset? Or do
we have to re schema-validate any branch added (including
all key/keyref checking)?
For Patterns and rules, there is a definite impact. Currently,
each pattern would typically be implemented as a separate
pass through the document. Perhaps a really nice XSLT
implementation might just use one pass, and evaluate the
rules (and assertions) during that single pass, but I doubt it:
having a more optimizable XPath might result in better
performance in this regard.
But within patterns, rules are evaluated lexically, with the first rule
whose context describes the current nodes' context being the
one used. One kind of rule that can be used is a guard rule,
where first we test whether some bad case has happened,
so the subsequent assertions are OK. It is an important
aspect in the design of rule-based languages to make
case statements implicit in various ways (either by
lexical ordering, or by assigning priorities).
When we test for the bad case, we really want to test it,
not have some other system tell us it is impossible.
For example, lets consider the following case:
<assert test="(//*[@id= current()/@idref]) = 1"
>A caseRef should reference one element</assert>
>A caseRef should only reference a case</assert>
If the query system believes that there is no way there can
be an @id element on some element, it will not actually
test it. For example, if the schema validation failed
for a branch, can we expect that an optimizer will be
smart enough to say "oh, they still need access to it,
I shouldn't optimize away their query in that regard"?
Within assertions, each assertion is evaulated without regard
to another. However, assertions can just as easily be
negative as well as positive: as well as
<assert test="x">A <name/> must have an x</assert>
you can have
<report test="x">A <name/> should not have an x</assert>
On the one hand it would be nice if assertions that could be
guaranteed to fail were never tested, but for a validation
language that completely fails a fundamental principle
of validation: you don't accept the judgement of another
component when you can test yourself. For example,
you want to check for errors introduced by schema
validation: has an attribute value on a local element
been defaulted correctly.
It is an utter database-ism to say that "well, the schema
says it must have defaulted correctly, because not
other value can have been loaded." Documents are
obviously not like that. The fundamental purpose
of a validation language is to detect errors for whatever
reason, not limited to only errors that assume that
every other link in a chain has worked properly.
Finally, for the diagnostics part of Schematron. This is
a really essential part of Schematron: the ability to
generate useful messages dynamically, reporting
on what has been found. If XPath2 always returns
null from a PSVI because a path I am asking for
is supposedly impossible, it renders XPath2 useless
As I said, I completely disagree with
the idea that Schematron should only validate constraints
that have not been validated by a previous stage:
that would mean I can have rules that are never
checked, and the results of validating could be quite
misleading. The schema says one thing, the
implementation tests another.
Second finally, it is entirely possible that a PSVI can have been
constructed using one schema, and a query run
using a different version of the schema. Unless there
is some mechanism for guaranteeing that the same
schema has been used (*not* the same namespace,
*not* the same resource or URL, but the same
constructed schema) general purpose validation tools
need to be able to test whether "impossible" things
And finally, I do not believe that optimizing a query
by default is in fact in accord with the XML Schemas
recommendation. As Part 1 of the Schema Spec says
"schema validity is not a binary predicate."
For a start, [validty] and [validation-attempted] are
properties of nodes. A PSVI does not only include
valid elements, it also includes invalid elements.
For XPath2 to make available PSVI augmentations
is one thing, but you seem to be requiring that
[validation-attempted] is always full and [validity]
is always valid.
Document level [validation-attempted] will not be "full"
if any component is not retrieved. We may be wanting
to validate a PSVI some time after it has been created,
when there are no validation messages, and we want
to get a scope of the problem or conformance.
So I think XPath2 implementations will need to report
whether they are set to (screw things up by) optimization
and whether they allow this to be changed. I understand
that it is reasonable to not overload a DBMS by asking
impossible paths, but a validation tool is interested in
QA is concerned with testing what is, not with accepting what
someone else claims is.
Whenever XPath2 is optimized to use Schema information to
cull "impossible" paths, the application is prevented from
rationally handling errors or assessment failures. For example, if a component
of a schema is to be accessed by URL (by a fully conforming
implementation), and becomes unavailable, I would like to be able
to have a fallback plan, to handle the element (which would be marked
[validation-attempted]=no) generically. That element
is still in the PSVI. I still need to access the invalid
elements in the PSVI to handle them rationally. Any system that strips
out "impossible" elements during schema validation is non-conforming
and should be ignored from any consideration in XPath (and XQuery).
A PSVI is not constrained to only have valid information items, and
all W3C tools which need to access the PSVI must not assume
validity. This is particularly true of XPath2. In contrast, XQuery,
if it is not really concerned with XML documents by only valid
PSVIs, should clearly state that, and the Query group should be
very careful not to let DBMS-assumptions (such as PSVI
validity or data/reference integrity) that are good for Queries
colour what XPath2 and its clients have to deal with.