Thanks Michael. And congrats on Saxon doing so well: Saxon HE being just as fast or faster than libxslt (ignoring JVM startup time) was one of the surprise results to me.
I agree that benchmarking the usual expected cases is just as important as benchmarking edge cases prone to blowouts. But I am not so sure why you think we cannot add the numbers up (or, rather, take the difference between each of the successive tests to estimate the times for each of the stages) given the engines all seem to be single-threaded for XSLT 1.0 stylesheets and all the engines run over completed DOMs rather than interleaved SAX events. I think it shows that for large documents, the cost of XML parsing is utterly dwarfed by the costs of the in-memory data structures and algorithms used for processing.
One of my motivations was having experienced a company that switched to a technology other than XSLT for efficiency reasons, as a result of benchmarking XSLT against the vendor's ETL product. But I understand they used Xalan-J for benchmarking, and I suspected that the vendor had gamed the benchmarks to produce bad results.
I guess it is a fallacy of composition: Xalan-J does worse in our benchmarks than our product; Xalan-J is an XSLT engine; therefore all XSLT engines are worse in our benchmarks than our product. That is only correct of all XSLT engines have the same order of performance, and my little benchmarks demonstrate that is not so: so much so that one cannot evaluate "XSLT" performance, we can only evaluate particular engines.
(The same evaluation process also claimed that it was impossible to put in rules for unexpected paths into XSLT, as if wildcards did not exist! Maybe there was more to it than was transmitted to me, but on the face of it, it is rubbish.)