Re: [xml-dev] ArchForms and LPDs

An interesting (to me, anyway) question is whether SIMD vector operations change the performance trade-offs of the various performance nasties: transcoding, normalization, good-name-character detection. (Clearly SIMD can for transcoding between the UTF-*s. There are now dozens of solid methods.)

To consider one, Intel added some parallel range-checking string operations (PCMPISTRI) starting SSE4.2 (in AVX too), specifically to help UTF-8 XML

it seems. The sweet spot seems to be sections of documents with lots of tokens and other rules.

In particular, these instructions allow you to read 16 bytes into a 128-bit Vector, then load 4 byte ranges into another,

and run a comparison over all of them in one instruction, and see if there are any bytes in any of these ranges.**

(And then locate the index of the first one in or out of the range.) Or you can check for the locations of 8 bytes values.

For example, we are at the name of a start-tag in out UTF-8 stream:

* Using a PCMPISTRI instruction, we check that every character is ASCII or whitespace or delimiter except &, with as much small granularity as possible given only 4 ranges.

* If so, then bypass unallowed-character detection, normalization, reference substitution (we allow them anywhere), name checking,

we use another PCMPISTRI instruction with looking for >\s\t\r\d: , to find if the name/prefix token ends within that 16 byte vector and where

* If not, process up to the index of the first strange character, and then switch to the big-banana parsing. Actually, you shift in the new bytes

and then apply more SIMD instructions starting there.

On paper, this might give you a 10x speed up (e.g. scalar= 16 x (8 compares + an or)= 144 ucycles), SIMD = 10 ucycles). I found an applicable example: a benchmark for a picoHTTPParser that got 68% to 90% faster parsing (http://blog.kazuhooku.com/2014/12/improving-parser-performance-using-sse.html) just adopting this approach in one case in the critical loop; and this was followed up with a AVX2 2.5 x speed (https://blog.cloudflare.com/improving-picohttpparser-further-with-avx2/) But actual numbers depends on the CPU model. These speed-ups are interesting, because they are speed-ups of quite a simple syntax.

So it seems to me that some extended checks may not be not complete performance killers the more that any of:

1) they can be folded into other checks (i.e. folding different range checks into the 1,2,and 3, byte conversions of UTF-8 to UTF-16),
2) have broad tests to avoid checks in common runs known to be good;
3) can be implemented with single lookup to small tables only

4) they can be folded into a state machine operating over vectors, synchronized starting at some milestone (such as element start delimiter.)

For example of 4, this kind of C-oid pattern (this uses jumping, you could use tables too of course, or any other critical loop approach):

reg int parse-position=0;

reg u128 int=0;

state1: // e.g. check reasonable characters

vector V = load-vector(text-buffer[parse-position]);

if (i = parallel-contains-ranges(V, range1, range2, range3, range4))
{

do-stuff(text-buffer, parse-position);

parse-position += i,

goto state2;

}

else if ( i = parallel-contains-ranges(V, range1, range2, range3, range4))
{

do-other-stuff(text-buffer, parse-position);

parse-position += i,

goto state3;

}

state2: ... // e.g look for delimiters and token end

These changes are also interesting because they expose a logical trap or vicious cycle**** that I think has afflicted much of XML for the last 15 years.

(In XML's case, as well as the external competitors of JSON etc, it has had an internal competitor in the cargo cult of XML+XSD: the future is XSD, nothing else matters: "You cannot do anything without types" was how it was expressed to me (by someone I do respect, please note): which is what we would now call cancelling and perhaps gaslighting

;-) )

Regards

Rick

HYPERFOOTNOTES:

* It seems it is not a priority for Intel/AMD to implement them as fast as they could, however. I think this is because the uptake was poor, and I think the uptake was poor because XML FOSS decision-makers are complacent. For example, Rob Cameron's team reimplemented Xerces in SIMD with tremendous results: how much has flowed back into Xerces?

And XML FOSS decision-makers are complacent presumably because they in their minds performances is an issue for JSON (and EXI?) only: they want to run out their investment in the XML code, not improve it , or perhaps don't want to adopt it because they won't have capable resources to maintaining***: not a silly reason at all, but regrettable.

This might improve if, for example, the C-based FOSS adopt OPM, or when Java adds better Vectors (work is currently underway). People who don't like intrinsic functions (i.e. SIMD assembler calls almost-directly exposed as functions in the language) may find annotations and language features more acceptable.

** These PCMPESTRI operations are not the best choice if you are not using all the 4 ranges or you are not using

bytes or you need to look at as large chunks as possible, in which case other SIMD instructions are better.

And it depends on the CPU design too. But they should still be better than doing it using non-SIMD instructions.

*** See https://github.com/h2o/picohttpparser/pull/10 for the comment why they didn't merge these speed up changes:

First:
I am afraid I would not be capable of maintaining the code. The approach used in the PR is a big leap from string operations that the non-SIMD version of picohttpparser and the SIMD-optimized version based on PCMPxSTRx instruction use. And personally I dot have any practical issue in the level of performance achieved by the current implementation of picohttpparser.

But the ultimate official reason**** was:
"We do not have a plan to merge this. HTTP/1 parsing performance is becoming less issue, as high volume traffic moves to HTTP/2 and /3. We think that the cost of maintaining the added complexity is not worth the performance benefit."

**** So the reasons above*** are a logical trap: we don't speed up our technology because there are other technologies that are faster, but the other technologies are faster because we don't speed up our technology. It is exactly what we hear of XML now, and should be a canary-in-the-mine. It is a version of the DevOps trap: software under development has momentum to get optimized, software in release has intertia against change. Which kinda means the idea of "get it working, then optimize" is, in some DevOps-ish environments, completely bogus: maybe you need to design for a small amount of immediate optimization, perhaps just one round, and then that is it. If the hype cycle doesn't get you, the DevOps cycle may. (I think this reactionary dynamic is not strong where there is an organization with just one product: e.g. Saxonica. )