Re: [xml-dev] XML spec and XSD

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Rick Jelliffe <rjelliffe@allette.com.au>
Date: Thu, 19 Nov 2009 18:12:53 +1100

Andrew Welch wrote:
> I mentioned the streaming aspect for 2 reasons:
>
> 1) If validation performance is an issue, can RNG + Schematron still
> be considered when XSD is validation so fast?
>   
I think Andrew is identifying one of the basic flaws of XSD evolution:  
an optimization controls the design.  

But I don't believe the claim that XSD validation is necessarily faster 
than RNG + Schematron, even at the poor state of optimization we have at 
the moment.

1) There are many kinds of constraints that Schematron can do that XSD 
1.1 assertions etc cannot do. XSD cannot be considered faster at things 
that are completely out of its reach!  On these, I can state that RELAX 
NG and Schematron are absolutely faster than XSD  :-) 

2) An XSD assertion requires that a local XDM trimmed branch be 
constructed. So the worst case for XSD is where all elements have an 
assertion: the same amount of tree building would have to go on as for 
building a full XDM tree for the typical non-streaming implementation of 
Schematron. There would be a space saving, but not necessarily a speed 
saving.  (In fact, I think only the non leaf elements would need 
assertions to get to this same point.)

3) There are Schematron implementations with a terminate on fail 
construction. (Ken Holman contributed this IIRC.) So where pass/fail 
testing is required,  these can be very fast in the amount of work they 
need to do. Combine them with a streaming implementation or even a 
lazily constructed DOM, and they certainly could be faster than an XSD 
implementation that attempts to run over a whole large file.

3a)  A RELAX NG implementation that just provides a validation result 
does much less work than an XSD implementation that produces a fill 
PSVI, too.

4) XSD implementations are not necessarily streaming, but may be random 
access. For example, my implementation of XSD by converting it to 
Schematron would use whatever the Schematron implementation used. Or a 
validator that ran over data in a database directly without pickling it 
first.

5) Where the application that uses the XML requires a tree, the tree 
needs to be built even if you have streaming validation: so you aren't 
actually saving any tree construction time or space. In fact, since the 
PSVI has no standard XML form or standard streaming API form, I actually 
imagine that most uses of XSD actually result in a tree being built (or 
the data being entered into a DBMS): the point of the PSVI is to make 
extra information available for systems which are typically random 
access, keyed access or object trees (anything except streaming!)   

6) Where a document is not large, it is not certain that a streaming 
implementation of a validator using a modern language with automatic 
garbage-collected will actually allocate or use fewer objects compared 
to a tree-building implementation.  And object-allocation-avoidance 
strategies such as a cross-thread pool of DOM objects can also benefit 
in-memory implementations just as much as streaming implementations. The 
size of documents limits the number of simultaneous process more in the 
case of the tree-building implementation, but not necessarily the number 
of objects allocated.  (In fact, if the system is a validator, it may be 
that the event stream may need to be queued untill validation has 
finished before passing it on to the application: this will limit 
opportunities from speed-ups due to less object allocation from pooling 
or singleton strategies, for example.)

The exception might be XSD validation used for firewalls. But when you 
look at, for example, the Lloyds London Market system, they validate 
incoming data using XSD for coarse-grain validation, then Schematron for 
fine-grain validation: non-streaming is not a bar for their documents.

7)  Where there is a resource constraint like a real-time constraint,  
benchmarking is ultimately the most objective way of determining 
performance. Whitebox knowledge of algorithms and implementation details 
may certainly give hints about behaviour, but they are just armchair 
hints that may vary with different implementations, schemas and input 
documents: an algorithm that is efficient but explosive may give better 
performance than an algorithm that is slow but inefficient. 

(For example, a system that takes  10+n^2  has better performance than a 
system that is 110 + 10n for n <11.  We know about XSLT engines that the 
slowest is at least 24 times slower than the fastest even for basic 
transformations, for example, so the constants could plausibly swamp the 
exponents. ) 

8) XSD schemas can be very large and verbose, with multiple files, and 
many internal checks of the schema components such as derivation by 
restriction and UPA. I see no reason to expect that  loading a large XSD 
schema with all that extra work would necessarily be more efficient than 
the effort in loading a RNC or Schematron schema.  Indeed, in XSD is is 
quite common that the schema is larger than the instance: even when 
there is a streaming validation, most of the process is taken up with 
creating persistent objects for the schema.

Putting all these together, I certainly concede that if you have a large 
document, a small memory, a compiled and pre-loaded schema, a small 
schema with only a few assertions, constraints that are only local, 
where the document is thrown away after validation and the PSVI or tree 
or stream not passed on, the document is parsed from XML rather than 
coming in as a DOM,  and you want the validation to get as much outcome 
as possible,  then you might reasonably suspect that a streaming 
implementation (whether XSD or RNG + Schematron) would be faster to go 
through an entire document to confirm that no errors exist than an 
in-memory implementation made with the same attention to memory issues, 
in the absence of benchmarking. 

Added to that, I think there is tremendous scope for optimization of 
Xpaths and XSLT, and consequently Schematron. Michael Kay's optimization 
work in XSLT and XPath is interesting. There are a lot of fun 
possibilities for Schematron-specific optimization based on getting fast 
results (e.g. http://www.topologi.com/public/SchematronHeuristic.pdf) or 
optimization on tries and feature sets 
(http://broadcast.oreilly.com/2009/06/validation-using-tries-and-fea.html)

> 2) Isn't it the case that some of the complexities of XSD are that way
> to allow for that validation speed?
>   
Do you have an example?  (I imagine it causes some simplifications as 
well as some complexities.)

Cheers
Rick Jelliffe

Follow-Ups:
- RE: [xml-dev] XML spec and XSD
  - From: "Michael Kay" <mike@saxonica.com>

References:
- Re: [xml-dev] XML spec and XSD
  - From: Mukul Gandhi <gandhi.mukul@gmail.com>
- Re: [xml-dev] XML spec and XSD
  - From: Mukul Gandhi <gandhi.mukul@gmail.com>
- Re: [xml-dev] XML spec and XSD
  - From: Tim Bray <Tim.Bray@Sun.COM>
- RE: [xml-dev] XML spec and XSD
  - From: "Michael Kay" <mike@saxonica.com>
- Re: [xml-dev] XML spec and XSD
  - From: Mukul Gandhi <gandhi.mukul@gmail.com>
- Re: [xml-dev] XML spec and XSD
  - From: Elliotte Rusty Harold <elharo@ibiblio.org>
- RE: [xml-dev] XML spec and XSD
  - From: "Glidden, Douglass A" <Douglass.A.Glidden@boeing.com>
- Re: [xml-dev] XML spec and XSD
  - From: Rick Jelliffe <rjelliffe@allette.com.au>
- Re: [xml-dev] XML spec and XSD
  - From: Andrew Welch <andrew.j.welch@gmail.com>
- Re: [xml-dev] XML spec and XSD
  - From: rjelliffe@allette.com.au
- Re: [xml-dev] XML spec and XSD
  - From: Andrew Welch <andrew.j.welch@gmail.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]