Thanks a lot Rick. Here are some valuable lessons that I have learned: -
When creating a design take into consideration its performance on large data sets. -
A design may work beautifully on small-scale data sets but fail miserably on large-scale data sets.
-
Test your design against large data sets (at least 100,000 records) before stamping it finished! Personally, I am going to apply my new insight into XML Schema design as well: an XML Schema design may use all sorts of nice organizational structures (abstract types, substitution groups, all, etc.) but if
schema-validation of large XML instances takes 100 seconds (or more) then that design is useless.
/Roger From: Rick Jelliffe [mailto:rjelliffe@allette.com.au]
Quality Xslt engines like saxon and ms' do all sorts of optimizations. But having one assertion per rule and one rule per pattern will tax even them. In the usual schematron implementation, it is eqivalent to saying "I want to make a separate complete pass over the document for each assertion", and
"I want to use 'if' but no 'elses' to find contexts" and "I want to perform all the tests to find the context each time." Slow. (That approach is fine sometimes, as in CAM where you want complete isolation of each test case. But it is not helpful for performance.) Perhaps I should say it in the negative. What I have found about Schematron is 1. Developers don't like the report element. They will use the assert element for "i found something bad" statements with complex upside-down logic, rather than use the report element, which is just what they want. 2. Developers don't like the assert element. They will put the assertion test into a predicate on the rule/@context, then just have an assert/@test with false(). 3. Developers don't like the rule element. They don't take advantages of rules as cases in a big case statement without fall through, and end up having xpaths with extra tests merely to distinguish items in similar contexts. 4. Developers dont like Let or xsl:key. If something is worth calculating, it is worth calculating as many times as you need it. 5. Developers love patterns. They must, otherwise they would not try to have a separate rule per pattern. Ai carumba. Being able to separate rules into patterns is the thing that makes the case-statement nature of rules effective:
if there was only one pattern, then each rule has to be very specific. 6. Developers dont like xpath: if they can break out of xpath into java they will try it. 7. Developers love adopting a blue sky methodology based on pure thought and elegance. For example, they have a performance critical application, but they adopt structuring conventions without benchmarking, and then when they do
benchmark they shoot themselves in the foot by using a slow xlst processor. If your assertion language didn't have patterns and rules and variables and keys, just assertions, then you would have very long and convoluted test xpaths, and you would require the xslt engine to do kinds of optimizations that
i don't think they currently do (perhaps "xpath fragment common expression folding" expresses it). If you use patterns to group rules, and rules to group assertions, and variables or keys to avoid recalculation, then your schema will usually be in a more efficient
state, where the kinds of optimizations that schema engines do can have an effect. If it turns out that there are some patterns or rules or assertions that make aggregate performance too slow then consider if you should put in a weaker faster test that allows some false negatives first, then test the failed documents
only with the more specific test. Like a Bloom filter. (You can model these as phases.) A coarse cheap net to catch most good documents, then a fine-grained expensive net to catch the remaining good documents. A 'red flag' system. But, if all else fails, then use your schematron schema as the specification and unit test for a C# or Scala etc implementation. Cheers On 09/11/2014 4:30 AM, "Costello, Roger L." <costello@mitre.org> wrote:
|