A Call for Rapprochement between W3C XSD and ISO DSDL: A
Non-Intrusive Extension Framework for XSD 1.1 to Support Schematron
and Beyond
Rick Jelliffe
2006-03-14
This note is a contribution to discussions on adding various kinds
of constraint checking to XSD 1.1. The bottom line is a call for a
rapprochement between W3C WSD and ISO DSDL: ISO DSDL is not a
stalking horse for RELAX NG but should be considered a valuable and
primary resource for
little languages and approaches for evolving XSD in positive
direction.
ISO 19757 Document Schema Description Languages (DSDL) is a
multi-part standard for standardizing small, narrow-focus schema
languages. It is often portrayed as some kind of attempted
competitor to XSD, due to OASIS RELAX NG being one part, but it ain't
necessarily so. If you consider ISO DSDL parts 3 and on as a series
of small schema languages designed to complement any grammar-based
schema language without adding to monolithic complexity, then XSD is
clearly the primary potential adopter of ISO DSDL.
The current state of affairs with the W3C XSD WG reminds me of the
SGML working group at ISO when we enhancing IS 8879 SGML to encompass
XML which we had recently developed through W3C. One group felt that
we just needed to parameterize SGML more, to complicate it, in order
to cope with the variations required by XML; another group, which I
belonged to, instead felt that layering was the answer: at a certain
point it becomes positively harmful to add complications to a base
specification. So we added an “additional constraints”
link (“SEEALSO”), which allowed the SGML constraints to
be extended by an external document: SGML validity remained
determinate and unitary, and the additional constraints could be
validated as a further, different type of validity (i.e. XML
well-formedness.)
What is the relevance to XSD? XSD is in the same position as SGML
was a decade ago: large, stiflingly difficult to implement, and with
a strong requirement not weaken determinate validity. Similarly, this
requirement not weaken validity is mistakenly opposed to ideas of
layering. In fact the reverse is true: layered systems are easier to
test, implement and reason about.
I write not only as the developer of Schematron, and former member
of both the W3C XSD WG and the ISO DSDL WG, but also as a commercial
developer of schema-related products including XSD. It has long been
obvious to me that the XSD-inspired dazedness would eventually clear
and that calls for schema capabilities beyond or above that of simple
grammars would then have their season: the intent of ISO DSDL has
been to collect such little languages for early adopters and for
general commmunity benefit as the fog/panic/exploration of XSD
clears.
Let me frank, but certainly with no disrespect intended. When XSD
was developed, perhaps a majority of the then XSD WG had never
actually written a serious DTD or other schema for XML (or SGML.)
This perhaps ultimately showed itself in a certain obsessional
intricacy in some minor areas (nillibility, the lack of integration
of keys and uniqueness into the type system, extension by suffixation
only, elementFormDefault, etc.) at the expense of other major areas
(patterns on mixed content for example.) I suspect that probably the
majority of the current XSD WG may have never written a
constraint-based schema for XML, e.g. with Xlinkit, XCSL, Schematron
or even XSLT. One cannot after all be a specialist in everything! The
tendency of the XSD WG may therefore to be less aware than grassroots
users of the advantages, characteristics and opportunities afforded
by Schematron, and therefore relegate it to a few convenient
categories such as “co-occurrence constraint.”
Schematron was developed in 1999, and has continued in its current
popularity solely because of its general-purpose utility, not because
of any hype or party spirit: XSD users favour it equally with RELAX
NG and DTD users. It is used for detecting conflicting flight plans
over Belgium, checking software architecture rules in USA, checking
local government forms in Japan, and the conformance of documents to
house rules by big three publishers including here in Australia.
Executive Summary
Support for simple co-occurrence constraints is better done
by allowing attributes as particles in content models rather than by
using path expressions. Recommend to adopt the mechanism used
successfully by ISO DSDL Part 2 (RELAX NG).
XSD needs an extension mechanism which will allow embedded
little languages with constraints required for extended validity.
Recommend new PSVI properties such as [extended validity] to support
extensibility. Raid and support the ISO DSDL effort for appropriate
extension languages.
XSD needs a constraint language, regardless of the support
for 1) above. It should use the extension mechanism in 2) above.
Recommend ISO DSDL Part 3 (Schematron) as a required (or strongly
recommended) extension.
XSLT Keys and Uniqueness probably should be moved out of Part
1 and re-cast as an extension. I don't suppose this is feasible, for
reason of scaring the horses. But it is an example of exactly the
kind of schema language that should be an extension. Because key and
uniqueness constraints are embedded in the Structures specification
currently, there is no XSD-compliant way in which developers can
experiment and evolve new schema languages. The danger of this is
that the XSD WG is forced to do armchair speculative development of
enhancements (“yeah that sounds good”): a recipe for
perpetual premature standardization, inadequate testing, and a sure
way to gather dead wood.
Attributes in Content Models
Support for simple co-occurrence constraints is better done by
allowing attributes as particles in content models rather than by
using path expressions. The approach used by ISO RELAX NG should be
adopted: it has proved to be straightforward to implement, easy for
users to understand, is declarative, streamable. This would require,
I believe, no changes to the PSVI.
Adopting this does not entail any notion of somehow saying “RELAX
NG was right and we were wrong”; on the contrary, interested
users would rejoice that XSD was adopting proven technology. RELAX NG
adopted this feature late, it was not obvious to James Clark and
Murato Makoto etc. that it was the correct feature to adopt. (I
believe I was one of the first to suggest it.) But it has proven
itself. I hope the XSD WG relentless refuses to take part in any
childish NIH-ism in this: XSD's earliest development was guided by
the proven experience with various deployed schema languages. I
strongly recommend that the XSD WG discipline itself to adopt proven
features from existing deployed schema languages when they are
available, as in this case.
Even though I am obviously a fan of Xpaths, introducing some
reduced Xpath based syntax is, I believe, the wrong approach here:
while technically feasible it plays to all XSD's weaknesses. It
complicates understanding while only addressing one small area:
exactly the kind of “not enough bangs per buck” that XSD
is notorious for.
In particular, one of Paul Biron's useful suggestions, to use the
streaming subset of Xpaths to identify a node whose presence provides
the condition for the occurrence of an attribute or element, is, I
think on reflection, the wrong way to go. First because the content
model enhancement above is simpler, cleaner and roughly equivalent.
Second because it uses paths for what they are OK at, but does not
use them for where they shine: with value-based predicates and with
random access. Third because the idea that just providing the most
basic occurrence constraints will actually satisfy user requirements
is wrong-headed: a tokenistic path language will merely temporarily
shift the boundary at which user frustration with XSD's power sets
in.
Extension Framework
XSD needs an extension mechanism which will allow embedded little
languages with constraints required for extended validity.
XSD is notoriously under-layered and complicated to reason about.
That even vendors so freely admit the difficulties they have faced in
implementing XSD properly should be uttermost in the mind of the XSD
WG, I believe; the failure of implementability in XSD must not be
quibbled out of, especially given that XSD had a long gestation
period that, one would expect, would have made early implementations
higher quality than expected.
So the WG needs to adopt a very different mindset: I am not
talking about changing the PSVI approach or type derivation, I am
talking of the futility of “valid means valid everywhere in all
implementations” when combined with a monolithic architecture.
The failure of XSD implementations to provide consistent validity
results is to some extent attributable to this monolithic
architecture. I suggest that the problem is not with “valid
means valid everywhere in all implementations” but in the lack
of extensibility in XSD: Appinfo is not enough.
My suggestion for an extension mechanism is below. The focus of
ISO DSDL (Document Schema Description Languages) is no to provide a
standard library of little languages, suitable for XSD to include or
allow by reference. These include ISO DSDL Part 3: DTLL (Datatype
Language Library) and ISO DSDL Part 7: CRDL (Character Repertoire
Description Language). In case there is any feeling that these are
somehow “anti W3C” technologies, I perhaps should note
that DTLL was developed by Jeni Tennison, invited expert to the W3C
XSLT WG, while CRDL was developed by Martin Duerst, long time head of
W3C Internationalization. Indeed, CRDL is based on a technical note
at the W3C
Extension Framework Details
All places where <xsd:appinfo> are allowed and the top-level
should allow a new element <xsd:extension>, allowing any
element in a non-XSD namespace
Attributes:
[Extended validation attempted] (Assessment Outcome
(Attribute)
[Extended validity] (Assessment Output (Attribute))
[Extended validity diagnostics] (Assessment Output
(Attribute))
Elements
[Extended validation attempted] (Assessment Outcome (Element)
[Extended validity] (Assessment Output (Element))
[Extended validity diagnostics] (Assessment Output (Element))
Document Root
[Extended validation attempted] (Assessment Outcome (Document
Root)
[Extended validity] (Assessment Output (Document Root))
[Extended validity diagnostics] (Assessment Output (Document
Root))
[Extended validity] is defined as [validity] plus successful use
of all elements in relevant extension elements. Importantly, this
allows an on-ramp for implementations to keep their current notations
of validity: they can allow but ignore all extensions. However, for
extended-validity (which should become the new default for
implementations to support) validation fails if either there is an
element in an unknown namespace (i.e. One for which the schema
implementation does not support) or if validation with those
constraints fail, then extended validation fails. This satisfies the
important objection to “optional” validation: extended
validity always means extended validity.
Note that it is a design requirement in XSD that [Validity] can be
assessed in a single-pass streaming fashion. It is not a design
requirement that [Extended Validity] can be assessed in this manner.
This split commends a layered approach.
Note that the extended validity of the Document Root refers to
outcomes of validating extensions defined on the document root, and
should not be confused with “the validity of the document”.
[Extended validation attempted] gives a list of the namespaces of
the children of the relevant extension elements, which provide keys
for different kinds of extended validation.
[Extended validity diagnostics] are lists of [namespace, text]
pairs, which provide the namespace of the extension coupled with a
human-readable text message, for example as generated dynamically by
Schematron. (Note: the PSVI extension do not limit the ability of an
API to report other information from schemas for various uses, or to
perform different kinds of non-standard validations.)
The presence of these extra PSVI items is the key to
extensibility. I don't believe any “required-extension”
mechanism is needed or warranted.
Schematron as a Required Extension
XSD 1.1 should define ISO Schematron as a required or strongly
recommended extension.
To some extent, attempting to cover all important bases with
exhaustive declarative enhanements to XSD becomes an exercise in
tail-chasing: even if XSD is extended with a dozen new co-occurrence
constraint elements, there will still be the need for a general
purpose constraint language. And, indeed, the best way to determine
which constraints should be generalized into some first-class
property in XSD is to first provide a general purpose constraint
language like Schematron to gather information and increase user and
WG expertise.
The subset of Schematron used conforms to ISO 19757-3
Information technology — Document Schema Definition Languages
(DSDL) — Part 3: Rule-based validation — Schematron
(2006) Annex F: Use of Schematron as a Vocabulary. The namespace used
is http://purl.oclc.org/dsdl/schematron
The following effective DTD are the required elements and
attributes of the subset. ISO Schematron defines other elements: it
is an error for them to be present. ISO Schematron defines other
attributes: it is not an error for these to be present; they may be
ignored.
<!ELEMENT sch:rule (sch:let*, (sch:assert | sch:report)+)>
<!ATTRIBUTE sch:rule
context (.) #FIXED '.'
id CDATA #IMPLIED>
<!ELEMENT sch:let EMPTY>
<!ELEMENT sch:let
name CDATA #REQUIRED
value CDATA #REQUIRED >
<!ELEMENT sch:assert (#PCDATA | sch:span | sch:emph | sch:dir | sch:name | sch:value-of)*>
<!ATTRIBUTE sch:assert
test CDATA #REQUIRED>
<!ELEMENT sch:report (#PCDATA | sch:span | sch:emph | sch:dir | sch:name | sch:value-of)*>
<!ATTRIBUTE sch:pattern
test CDATA #REQUIRED>
<!ELEMENT sch:span (#PCDATA)> <!ELEMENT sch:emph (#PCDATA)> <!ELEMENT sch:dir (#PCDATA)>
<!ELEMENT sch:name EMPTY> <!ATTLIST sch:name select CDATA #IMPLIED > <!ELEMENT sch:value-of EMPTY> <!ATTLIST sch:name select CDATA #IMPLIED >
Note that in this subset:
The context attribute is restricted to be “.”.
In the case of an <extension> element that appears at the
top-level of a schema rather than in a content model, this is “/”
or the document root node (not the root element). For example, this
allows a constraining the top-level element of any document to a
certain range.
No special presentation processing is required for the text
of elements span, emph and dir in the PSVI.
The element sch:name should be resolved to the qname of the
local element attribute (or type if that is ever possible.)
Phases, diagnostics, patterns, abstract rules and abstract
patterns are not part of the subset defined.
The path expression in the test attribute is interpreted as a
boolean expression; it may not resolve to a particular type or node.
For simple co-occurrence constraints, use the extended path
expressions above.
The path expressions are interpreted as if they are
type-aware Xpath 2 path expressions. If an implementation can only
handle some simpler subset, such as Xpath 1, the implementation
fails with an error at run time.
The path expressions may require more than streaming access.
This is one issue which sets apart simple [validity] from [extended
validity]
For other semantics, see the ISO Schematron spec, e.g. At
http://www.schematron.com/
I would like to stress that the provision of Schematron in
extension elements reduces lock-in. At some future stage, some
bright people unknown could come up with some better system as yet
undreamed of. At that time, the XSD WG can then adopt the new
constraint system as the required extension, and obsolete
Schematron. Compare this with the difficulty in, say, adding a new
facet or changing the key and uniqueness constraints in monolithic
XSD 1.0.
The provision of Schematron simplifies the task of XSD
enhancement, because it gives a plausible workaround for rejected
requirements to users. For example, a user who wants to specify that
the top-level element must be “book”
|