The XSD validator which I wrote in XSLT and described at Markup UK 2018
is still sitting on an internal shelf and hasn't seen the light of day in public, though it reached the point where it was passing something like 95% of the tests.
This was a "back end" schema validator only; it relied on Saxon's Java schema compiler to process the raw XSD documents, including generation of finite state automata for the complex types. But I don't think that doing the front end in XSLT would be particularly difficult (in fact, most of the difficulties are in the back end). Verifying subsumption of restricted types is probably the hardest part.
There are a few issues described in the paper which Rick's note doesn't address:
* assertions would be straightforward if they used untyped XPath. But they don't; they work on semi-validated data (validated against everything except the assertions), and constructing semi-validated data in (non-schema-aware?) XSLT poses a challenge. For example, in an assertion, "@discount lt @price" compares the typed values of the two attributes, not the untyped values.
* XSD rules for equality of atomic values (for example, in uniqueness constraints) aren't the same as XPath equality rules (e.g timezone handling is different)
Yes, working with the XSD specification is a nightmare; it's the toughest spec I've ever had to work with other than Algol 68, and unlike Algol 68, some of the apparent formality turns out to be spurious; when it gets to tricky things that ought to be formal, like whether two types are identical, the spec bails out. Perhaps I'm a masochist, but for me, that's a fun engineering challenge.
I've considered the approach of validating complex types by turning them into regular expressions against a string and using a regex engine. The main reason I decided against it is that regex engines produce no useful diagnostics; they just tell you the string doesn't match. Perhaps the answer to that would be to write a regex engine with better diagnostics - I can see that being useful!
Michael Kay Saxonica
People interested in doing this should feel free to grab code from https://github.com/Schematron/schematron/tree/master/trunk/xsd2sch (or even update it!)
In about 2008, JSTOR sponsored an R&D project to implement the reasonably large subset of XSD 1.0 that they used, to run as Schematron: this was not only to advance the state of the art, but because they were (I gather) finding XSD validators of the time just spewed out standard messages and numbers, which were as unhelpful as Voynich to editors and so on. (Perhaps they wanted to use apps and pipelines that did not support XSD too? Phases/progressive validation could also open up some extra workflow possibilities.)
The coverage is approximately: - simple datatypes: believed to be 100%
- list and union datatypes: not supported
- structural constraints on elements and attributes: supported (~)
- multiple namespaces, import and include: supported (~)
- identity constraints: not supported
- dynamic constraints: (xsi:type, xsi:nill) not supported
- tricky prefixes: (elementFormDefault) not supported
Obviously implementing identity constraints and xsd:assert would be a doddle. (There is a page on identity constraints at the link below to give the idea.) It needs much more testing to be ready for commercial use, but is good enough for targetted use or cannibalization.
The main difficulty of the project was retaining technical staff, if I recall: they absolutely hated having to deal with the XSD specification and found the technology had too many edge cases to be tractable, which meant that the project had to be organized in small discrete chunks-- not for Scrum reasons but just for mental fatigue. (These were not dummies: one was working through his PhD, another ended up in Redmond.)
I guess the main surprise to come out of it was that we could validate content models using XPath 2. Originally we started with just pairwise validation for element content types: x/y can only be followed by z, etc but it dawned on me that we could make a string listing the names of child elements in sequence, separated by spaces (e.g. "head body"), and test if that matched a regex generated from the content model, which took care of cardinality constraints too. (Which meant that Schematron was strictly more powerful than XSD 1.0.)
The joy at finding we could do content model grammar validation was tempered by the realization that we could not give much better validation diagnostics: the messages always had to be in terms of where the error was detected rather than what caused it. E.b if the content model was ( A, ( B, Z, X) | Z) and the instand had A, Z, X it would say "we found unexpected X here instead of Z" rather than e.g "After A, B is missing, so you cannot have the Z followed by an X." Presumably some extra smarts could be added fir this, and perhaps the XSD could gave sone annotations to help.
The larger issue was that Schematron allows semantic assertions and diagnostics: you can express a constraint in natural language in the terms that target user understands, and give feedback to them. (A real example: I was working on a pipeline system where the edited documents were translated into several intermediate XML vocabs and structures before being output and validated. The company employed devops people to look at the validation logs, then trace back to the original authoring format, then decide if it were a programming error or markup error.) So merely converting an XSD to Schematron did not allow the advantage of having efficient, specific, targetted feedback.
(It goes deeper than the names. The grammar-based schemas have no capability of capturing and transmitting intention: if an attribute or element is required, why is it required? If a content model is super-complicated, what simpler pattern is actually being modelled, albeit clumsily? )
I would not want to implement this again using XSLT 2. Maybe 3 is better (?) but I think doing at least some of the stages in some general-purpose language (Java, etc) that allowed decoratable objects would have reduced the mental complexity a lot: immutability just sucks sometimes.
Cheers Rick
Hi Folks, The Schematron processor that I use is an XSLT program that takes as input a Schematron schema and the XSLT program transforms the Schematron schema into an XSLT program that is specific to the Schematron schema: Schematron schema --> XSLT --> XSLT for the particular Schematron schema Then the “XSLT for the particular Schematron schema” is run and it inputs the XML document to be validated. The output is the validation results: XML doc to be validated --> XSLT for the particular Schematron schema --> validation results Rick et al chose to implement Schematron validation by generating a stylesheet for the particular Schematron schema. An alternative strategy would have been to create a universal stylesheet that directly performs Schematron validation on the XML doc to be validated: XML doc to be validated --> universal stylesheet --> validation results Interestingly, Michael Kay has a blog post (https://dev.saxonica.com/blog/mike/2018/02/could-we-write-an-xsd-schema-processor-in-xslt.html)
in which he discusses the idea of using XSLT to build an XML Schema validator. He explores the idea of whether to write an XSLT program that generates another XSLT program (as Schematron does) or whether to write a universal XSLT program. At the end of his
blog, Michael writes: I still have an open mind about whether a universal stylesheet should be used, or a generated stylesheet for a particular schema. A fascinating parallel, I think. /Roger
|