Re: [xml-dev] Write an XSLT program that generates an XSLT program orwri

People interested in doing this should feel free to grab code from https://github.com/Schematron/schematron/tree/master/trunk/xsd2sch (or even update it!)

In about 2008, JSTOR sponsored an R&D project to implement the reasonably large subset of XSD 1.0 that they used, to run as Schematron: this was not only to advance the state of the art, but because they were (I gather) finding XSD validators of the time just spewed out standard messages and numbers, which were as unhelpful as Voynich to editors and so on. (Perhaps they wanted to use apps and pipelines that did not support XSD too? Phases/progressive validation could also open up some extra workflow possibilities.)

The coverage is approximately:

simple datatypes: believed to be 100%

list and union datatypes: not supported

structural constraints on elements and attributes: supported (~)

multiple namespaces, import and include: supported (~)

identity constraints: not supported

dynamic constraints: (xsi:type, xsi:nill) not supported
tricky prefixes: (elementFormDefault) not supported

Obviously implementing identity constraints and xsd:assert would be a doddle. (There is a page on identity constraints at the link below to give the idea.) It needs much more testing to be ready for commercial use, but is good enough for targetted use or cannibalization.

The main difficulty of the project was retaining technical staff, if I recall: they absolutely hated having to deal with the XSD specification and found the technology had too many edge cases to be tractable, which meant that the project had to be organized in small discrete chunks-- not for Scrum reasons but just for mental fatigue. (These were not dummies: one was working through his PhD, another ended up in Redmond.)

Anyway, the code is there, and descriptions of the approaches (originally on OReilly's blog) is at Schematron.com (find "Converting XML Schemas to Schematron" for background) with details at https://schematron.com/document/2974.html

I guess the main surprise to come out of it was that we could validate content models using XPath 2. Originally we started with just pairwise validation for element content types: x/y can only be followed by z, etc but it dawned on me that we could make a string listing the names of child elements in sequence, separated by spaces (e.g. "head body"), and test if that matched a regex generated from the content model, which took care of cardinality constraints too. (Which meant that Schematron was strictly more powerful than XSD 1.0.)

The joy at finding we could do content model grammar validation was tempered by the realization that we could not give much better validation diagnostics: the messages always had to be in terms of where the error was detected rather than what caused it. E.b if the content model was ( A, ( B, Z, X) | Z) and the instand had A, Z, X it would say "we found unexpected X here instead of Z" rather than e.g "After A, B is missing, so you cannot have the Z followed by an X." Presumably some extra smarts could be added fir this, and perhaps the XSD could gave sone annotations to help.

The larger issue was that Schematron allows semantic assertions and diagnostics: you can express a constraint in natural language in the terms that target user understands, and give feedback to them. (A real example: I was working on a pipeline system where the edited documents were translated into several intermediate XML vocabs and structures before being output and validated. The company employed devops people to look at the validation logs, then trace back to the original authoring format, then decide if it were a programming error or markup error.) So merely converting an XSD to Schematron did not allow the advantage of having efficient, specific, targetted feedback.

(It goes deeper than the names. The grammar-based schemas have no capability of capturing and transmitting intention: if an attribute or element is required, why is it required? If a content model is super-complicated, what simpler pattern is actually being modelled, albeit clumsily? )

I would not want to implement this again using XSLT 2. Maybe 3 is better (?) but I think doing at least some of the stages in some general-purpose language (Java, etc) that allowed decoratable objects would have reduced the mental complexity a lot: immutability just sucks sometimes.

Cheers

Rick

On Mon, 9 May 2022, 21:16 Roger L Costello, <costello@mitre.org> wrote:

Hi Folks,

The Schematron processor that I use is an XSLT program that takes as input a Schematron schema and the XSLT program transforms the Schematron schema into an XSLT program that is specific to the Schematron schema:

Schematron schema --> XSLT --> XSLT for the particular Schematron schema

Then the “XSLT for the particular Schematron schema” is run and it inputs the XML document to be validated. The output is the validation results:

XML doc to be validated --> XSLT for the particular Schematron schema --> validation results

Rick et al chose to implement Schematron validation by generating a stylesheet for the particular Schematron schema.

An alternative strategy would have been to create a universal stylesheet that directly performs Schematron validation on the XML doc to be validated:

XML doc to be validated --> universal stylesheet --> validation results

Interestingly, Michael Kay has a blog post (https://dev.saxonica.com/blog/mike/2018/02/could-we-write-an-xsd-schema-processor-in-xslt.html) in which he discusses the idea of using XSLT to build an XML Schema validator. He explores the idea of whether to write an XSLT program that generates another XSLT program (as Schematron does) or whether to write a universal XSLT program. At the end of his blog, Michael writes:

I still have an open mind about whether a universal stylesheet should be used, or a generated stylesheet for a particular schema.

A fascinating parallel, I think.

/Roger