Re: [xml-dev] Write an XSLT program that generates an XSLT program orwrite a universal XSLT program
People interested in doing this should feel free to grab code from
https://github.com/Schematron/schematron/tree/master/trunk/xsd2sch (or even update it!)
In about 2008, JSTOR sponsored an R&D project to implement the reasonably large subset of XSD 1.0 that they used, to run as Schematron: this was not only to advance the state of the art, but because they were (I gather) finding XSD validators of the time just spewed out standard messages and numbers, which were as unhelpful as Voynich to editors and so on. (Perhaps they wanted to use apps and pipelines that did not support XSD too? Phases/progressive validation could also open up some extra workflow possibilities.)
The coverage is approximately:
- simple datatypes: believed to be 100%
- list and union datatypes: not supported
- structural constraints on elements and attributes: supported (~)
- multiple namespaces, import and include: supported (~)
- identity constraints: not supported
- dynamic constraints: (xsi:type, xsi:nill) not supported
- tricky prefixes: (elementFormDefault) not supported
Obviously implementing identity constraints and xsd:assert would be a doddle. (There is a page on identity constraints at the link below to give the idea.) It needs much more testing to be ready for commercial use, but is good enough for targetted use or cannibalization.
The main difficulty of the project was retaining technical staff, if I recall: they absolutely hated having to deal with the XSD specification and found the technology had too many edge cases to be tractable, which meant that the project had to be organized in small discrete chunks-- not for Scrum reasons but just for mental fatigue. (These were not dummies: one was working through his PhD, another ended up in Redmond.)
Anyway, the code is there, and descriptions of the approaches (originally on OReilly's blog) is at Schematron.com (find "Converting XML Schemas to Schematron" for background) with details at
https://schematron.com/document/2974.html
I guess the main surprise to come out of it was that we could validate content models using XPath 2. Originally we started with just pairwise validation for element content types: x/y can only be followed by z, etc but it dawned on me that we could make a string listing the names of child elements in sequence, separated by spaces (e.g. "head body"), and test if that matched a regex generated from the content model, which took care of cardinality constraints too. (Which meant that Schematron was strictly more powerful than XSD 1.0.)
The joy at finding we could do content model grammar validation was tempered by the realization that we could not give much better validation diagnostics: the messages always had to be in terms of where the error was detected rather than what caused it. E.b if the content model was ( A, ( B, Z, X) | Z) and the instand had A, Z, X it would say "we found unexpected X here instead of Z" rather than e.g "After A, B is missing, so you cannot have the Z followed by an X." Presumably some extra smarts could be added fir this, and perhaps the XSD could gave sone annotations to help.
The larger issue was that Schematron allows semantic assertions and diagnostics: you can express a constraint in natural language in the terms that target user understands, and give feedback to them. (A real example: I was working on a pipeline system where the edited documents were translated into several intermediate XML vocabs and structures before being output and validated. The company employed devops people to look at the validation logs, then trace back to the original authoring format, then decide if it were a programming error or markup error.) So merely converting an XSD to Schematron did not allow the advantage of having efficient, specific, targetted feedback.
(It goes deeper than the names. The grammar-based schemas have no capability of capturing and transmitting intention: if an attribute or element is required, why is it required? If a content model is super-complicated, what simpler pattern is actually being modelled, albeit clumsily? )
I would not want to implement this again using XSLT 2. Maybe 3 is better (?) but I think doing at least some of the stages in some general-purpose language (Java, etc) that allowed decoratable objects would have reduced the mental complexity a lot: immutability just sucks sometimes.
Cheers
Rick