[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] XML spec and XSD
- From: Rick Jelliffe <rjelliffe@allette.com.au>
- Date: Thu, 19 Nov 2009 18:12:53 +1100
Andrew Welch wrote:
> I mentioned the streaming aspect for 2 reasons:
>
> 1) If validation performance is an issue, can RNG + Schematron still
> be considered when XSD is validation so fast?
>
I think Andrew is identifying one of the basic flaws of XSD evolution:
an optimization controls the design.
But I don't believe the claim that XSD validation is necessarily faster
than RNG + Schematron, even at the poor state of optimization we have at
the moment.
1) There are many kinds of constraints that Schematron can do that XSD
1.1 assertions etc cannot do. XSD cannot be considered faster at things
that are completely out of its reach! On these, I can state that RELAX
NG and Schematron are absolutely faster than XSD :-)
2) An XSD assertion requires that a local XDM trimmed branch be
constructed. So the worst case for XSD is where all elements have an
assertion: the same amount of tree building would have to go on as for
building a full XDM tree for the typical non-streaming implementation of
Schematron. There would be a space saving, but not necessarily a speed
saving. (In fact, I think only the non leaf elements would need
assertions to get to this same point.)
3) There are Schematron implementations with a terminate on fail
construction. (Ken Holman contributed this IIRC.) So where pass/fail
testing is required, these can be very fast in the amount of work they
need to do. Combine them with a streaming implementation or even a
lazily constructed DOM, and they certainly could be faster than an XSD
implementation that attempts to run over a whole large file.
3a) A RELAX NG implementation that just provides a validation result
does much less work than an XSD implementation that produces a fill
PSVI, too.
4) XSD implementations are not necessarily streaming, but may be random
access. For example, my implementation of XSD by converting it to
Schematron would use whatever the Schematron implementation used. Or a
validator that ran over data in a database directly without pickling it
first.
5) Where the application that uses the XML requires a tree, the tree
needs to be built even if you have streaming validation: so you aren't
actually saving any tree construction time or space. In fact, since the
PSVI has no standard XML form or standard streaming API form, I actually
imagine that most uses of XSD actually result in a tree being built (or
the data being entered into a DBMS): the point of the PSVI is to make
extra information available for systems which are typically random
access, keyed access or object trees (anything except streaming!)
6) Where a document is not large, it is not certain that a streaming
implementation of a validator using a modern language with automatic
garbage-collected will actually allocate or use fewer objects compared
to a tree-building implementation. And object-allocation-avoidance
strategies such as a cross-thread pool of DOM objects can also benefit
in-memory implementations just as much as streaming implementations. The
size of documents limits the number of simultaneous process more in the
case of the tree-building implementation, but not necessarily the number
of objects allocated. (In fact, if the system is a validator, it may be
that the event stream may need to be queued untill validation has
finished before passing it on to the application: this will limit
opportunities from speed-ups due to less object allocation from pooling
or singleton strategies, for example.)
The exception might be XSD validation used for firewalls. But when you
look at, for example, the Lloyds London Market system, they validate
incoming data using XSD for coarse-grain validation, then Schematron for
fine-grain validation: non-streaming is not a bar for their documents.
7) Where there is a resource constraint like a real-time constraint,
benchmarking is ultimately the most objective way of determining
performance. Whitebox knowledge of algorithms and implementation details
may certainly give hints about behaviour, but they are just armchair
hints that may vary with different implementations, schemas and input
documents: an algorithm that is efficient but explosive may give better
performance than an algorithm that is slow but inefficient.
(For example, a system that takes 10+n^2 has better performance than a
system that is 110 + 10n for n <11. We know about XSLT engines that the
slowest is at least 24 times slower than the fastest even for basic
transformations, for example, so the constants could plausibly swamp the
exponents. )
8) XSD schemas can be very large and verbose, with multiple files, and
many internal checks of the schema components such as derivation by
restriction and UPA. I see no reason to expect that loading a large XSD
schema with all that extra work would necessarily be more efficient than
the effort in loading a RNC or Schematron schema. Indeed, in XSD is is
quite common that the schema is larger than the instance: even when
there is a streaming validation, most of the process is taken up with
creating persistent objects for the schema.
Putting all these together, I certainly concede that if you have a large
document, a small memory, a compiled and pre-loaded schema, a small
schema with only a few assertions, constraints that are only local,
where the document is thrown away after validation and the PSVI or tree
or stream not passed on, the document is parsed from XML rather than
coming in as a DOM, and you want the validation to get as much outcome
as possible, then you might reasonably suspect that a streaming
implementation (whether XSD or RNG + Schematron) would be faster to go
through an entire document to confirm that no errors exist than an
in-memory implementation made with the same attention to memory issues,
in the absence of benchmarking.
Added to that, I think there is tremendous scope for optimization of
Xpaths and XSLT, and consequently Schematron. Michael Kay's optimization
work in XSLT and XPath is interesting. There are a lot of fun
possibilities for Schematron-specific optimization based on getting fast
results (e.g. http://www.topologi.com/public/SchematronHeuristic.pdf) or
optimization on tries and feature sets
(http://broadcast.oreilly.com/2009/06/validation-using-tries-and-fea.html)
> 2) Isn't it the case that some of the complexities of XSD are that way
> to allow for that validation speed?
>
Do you have an example? (I imagine it causes some simplifications as
well as some complexities.)
Cheers
Rick Jelliffe
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]