RE: [xml-dev] RE: ANN: 10 Laws of Schematron Design

Thanks a lot Rick.

Here are some valuable lessons that I have learned:

- When creating a design take into consideration its performance on large data sets.

- A design may work beautifully on small-scale data sets but fail miserably on large-scale data sets.

- Test your design against large data sets (at least 100,000 records) before stamping it finished!

Personally, I am going to apply my new insight into XML Schema design as well: an XML Schema design may use all sorts of nice organizational structures (abstract types, substitution groups, all, etc.) but if schema-validation of large XML instances takes 100 seconds (or more) then that design is useless.

/Roger

From: Rick Jelliffe [mailto:rjelliffe@allette.com.au]
Sent: Sunday, November 09, 2014 8:06 PM
To: Costello, Roger L.
Cc: xml-dev@lists.xml.org
Subject: Re: [xml-dev] RE: ANN: 10 Laws of Schematron Design

Quality Xslt engines like saxon and ms' do all sorts of optimizations.

But having one assertion per rule and one rule per pattern will tax even them. In the usual schematron implementation, it is eqivalent to saying "I want to make a separate complete pass over the document for each assertion", and "I want to use 'if' but no 'elses' to find contexts" and "I want to perform all the tests to find the context each time." Slow. (That approach is fine sometimes, as in CAM where you want complete isolation of each test case. But it is not helpful for performance.)

Perhaps I should say it in the negative. What I have found about Schematron is

1. Developers don't like the report element. They will use the assert element for "i found something bad" statements with complex upside-down logic, rather than use the report element, which is just what they want.

2. Developers don't like the assert element. They will put the assertion test into a predicate on the rule/@context, then just have an assert/@test with false().

3. Developers don't like the rule element. They don't take advantages of rules as cases in a big case statement without fall through, and end up having xpaths with extra tests merely to distinguish items in similar contexts.

4. Developers dont like Let or xsl:key. If something is worth calculating, it is worth calculating as many times as you need it.

5. Developers love patterns. They must, otherwise they would not try to have a separate rule per pattern. Ai carumba. Being able to separate rules into patterns is the thing that makes the case-statement nature of rules effective: if there was only one pattern, then each rule has to be very specific.

6. Developers dont like xpath: if they can break out of xpath into java they will try it.

7. Developers love adopting a blue sky methodology based on pure thought and elegance. For example, they have a performance critical application, but they adopt structuring conventions without benchmarking, and then when they do benchmark they shoot themselves in the foot by using a slow xlst processor.

If your assertion language didn't have patterns and rules and variables and keys, just assertions, then you would have very long and convoluted test xpaths, and you would require the xslt engine to do kinds of optimizations that i don't think they currently do (perhaps "xpath fragment common expression folding" expresses it). If you use patterns to group rules, and rules to group assertions, and variables or keys to avoid recalculation, then your schema will usually be in a more efficient state, where the kinds of optimizations that schema engines do can have an effect.

If it turns out that there are some patterns or rules or assertions that make aggregate performance too slow then consider if you should put in a weaker faster test that allows some false negatives first, then test the failed documents only with the more specific test. Like a Bloom filter. (You can model these as phases.) A coarse cheap net to catch most good documents, then a fine-grained expensive net to catch the remaining good documents. A 'red flag' system.

But, if all else fails, then use your schematron schema as the specification and unit test for a C# or Scala etc implementation.

Cheers
Rick

On 09/11/2014 4:30 AM, "Costello, Roger L." <costello@mitre.org> wrote:

Hi Folks,

Rick Jelliffe provided good reasons for why the below design approach is not good.

I just found another reason: performance is terrible with the approach!

Using the approach described below I implemented a bunch of Schematron rules and evaluated the rules on an XML document that has over 110,000 items. The execution time was: 100 seconds.

The approach described below results in creating lots of Schematron <sch:rule> elements, each containing only one <sch:assert>. Well, lots of <sch:rule> elements makes for terrible performance. So, I consolidated the <sch:rule> elements into just three, each containing a bunch of <sch:assert> elements. The execution time dropped to 5 seconds! Wow! That is a huge difference.

Bottom line: when designing Schematron, use few <sch:rule> elements.

/Roger

From: Costello, Roger L.
Sent: Wednesday, October 08, 2014 2:04 PM
To: xml-dev@lists.xml.org
Subject: ANN: 10 Laws of Schematron Design

Hi Folks,

I think the below “10 Laws of Schematron Design” are good. I distilled them from examination of this [1]. I welcome your additions to these “Laws”.

/Roger

-----------------------------------------------------------------------------------------------

1. Assign each business rule a unique identifier. For example, assign this rule:

               Each Book must contain Title, Author, Date, ISBN, and Publisher
               information.

the ID: Books-ID-00015.

2. In each file place just one pattern element (i.e., the file doesn’t have the schema element).

3. Give each pattern a unique identifier, e.g., <sch:pattern id=”Books-ID-00015”

4. In each pattern element place one rule element and in the rule element place one assert element. Give the assert element the same ID as the pattern element, e.g., <sch:assert id=”Books-ID-00015”

5. In the body of the assert element is a natural language description of rule. Start that description with the ID within brackets followed by [ERROR] followed by the natural language description, e.g.,

                <sch:assert test=”…” id=”Books-ID-00015”>
                     [Books-ID-0015][ERROR] Each Book must contain Title, Author,
                      Date, ISBN, and Publisher information.

6. Give the filename the same name as the ID, e.g. Books-ID-00015.sch

7. Identifiers are all of the format Book-ID-XXXXX, with rule files named Books_ID_XXXXX.sch.

Rationale for design guidelines 1-7: By following the guidelines there is complete traceability. Each business rule can be traced to exactly one Schematron implementation. Conversely, each Schematron implementation can be traced to exactly one business rule. Also, it is easy to change an implementation of a rule: simply replace the file containing the rule with another file that implements the same rule.

8. Create a Book Schematron Guide which describes every file, e.g.,

2.1 - Rules/Books/Books_ID_00015.sch

Rule Description: Books-ID-00015 Each Book must contain Title, Author, Date, ISBN, and Publisher information.

Code Description: If the context item is a Book, then …

Schematron Code: -- show the Schematron code here –

9. Create a top-level Schematron file. This file has no patterns or rules. It has the schema root element. It has all namespace declarations. It includes all of the other files. It defines any variables that will be needed in any of the files. It defines any (XSLT) functions that will be needed in any of the files. Validate instance documents against this file; the instance document will be validated against all the rules defined in all the files.

Rationale for design guideline 9: No need to perform multiple Schematron validations; simply validate instances against this top-level Schematron file. There is a single, central place where namespaces and variables and functions are defined, so it is easy to adjust them, when needed.

10. Identify common information patterns in the business rules; define the pattern once (use Schematron abstract patterns and abstract rules); reuse the pattern wherever possible. Place the files that contain abstract patterns/rules in a “Lib” folder. Place the files that contain concrete (non-abstract) patterns/rules in a “Rules” folder and within there create subfolders, e.g., Books, Magazines, Articles.

[1] http://www.dni.gov/files/documents/CIO/ICEA/Foxtrot/ISM_V10.zip