Hi Rick,
- I took the time to write down a little post
Neat paper!
Ta.
- It would allow in-place parsing. This is perhaps the
big thing that prevents various kinds of optimized XML parsers
What is “in-place parsing”?
Where once you read your text into a memory buffer, you do not need to allocate extra buffers in order to parse it in a strict sequence and present it to applications whole.
For example, say we have this entity declaration:
<!ENTITY XXX "<X>hello</X> world" >
and our document has
<Aaa>two: &XXX;!!! &XXX;!!!</Aaa>
A parser reads all this into a buffer (remember, we are before any SAX kind of API). Our application wants to count how many "world!!!" Strings are in Aaa using a regex matcher. So somewhere between the incoming text block and the regular expression matcher, the entity reference need to be resolved and substituted.
In a conventional parser, we would read the document and transfer the character over to a new buffer and substitute the entity on the fly. (It may try various optimisations, such as pre-scanning for entity references first and if there are none, passing a pointer back to a slice of the original buffer.) (It may punt, and provide several text sections, as SAX implementations may: but then this will need to be done by some later stage.)
- For several decades I have dabbled with methods to speed up parsing UTF-8 and XML using SIMD and parallel
parsing: my conclusion is that the approach I am suggesting here is the only feasible way for XML to not be sidelined as slow and complex.
SIMD = Single Instruction, Multiple Data?
This website [1] defines SIMD this way:
Single Instruction, Multiple Data (SIMD) units refer to hardware components that perform the same operation on multiple data operands concurrently. Typically, a SIMD unit receives as input two vectors (each one with a set of operands),
performs the same operation on both sets of operands (one operand from each vector), and outputs a vector with the results.
Would you provide a description of how to process an XML document in parallel? The website says “perform the same operation on multiple operands concurrently” when would you want to do this with XML? Would you provide an example, please?
25 years ago, CPUs with multiple cores and cores which support SIMD operations were the exception. For 15 years now they have been the rule. Indeed, the CPU makers attempted to prolong Moore's law by this method. In fact, the processing power of a single-threaded MIMD binary has usually slowed down: individual threads need to be slower because of heating
The issue of how to take advantage of parallelism has been the dominant factor in computer design for the last 20 years: there are 4 basic approaches:
flip into assembler code perhaps with some syntactical sugar (C and "intrinsic functions");
figure out how to make existing compilers generate more parallel code (Java, C#, XSLT);
provide explicit statements for parallelism in the language (often as annotations, notable OMP), of by adopting a different paradigm (e.g. GPUs like NVIDIA)
All of these work best when you have a continuous run of data that you can plough through avoiding branching decisions. There is a lot of work on finding branchless versions of algorithms.
So there are basically 3 different kinds of parallelism exploitable in CPUs:
multicore/multithread, where you divide your application into separate parts that can be run on different cores at the same time;
data-parallel like GPUs, where you run hundreds of copies of exactly the same program with each copy starting at a different point in a data array;
SIMD where the CPU allows you to take a small sequence of an array of data and run multiple versions of an operation (e.g. so that instead of 4 sequential operations to add 1 to successive integers (read, add, write =3 operations x 4 = total 12 operations), the core reads in 4 integers at a time and performs the add 1 on all of them (read , add, write all 4 = total 3 operations.)
Furthermore, there is the constraint that moden CPUs are so highly caching and pre-fetching that "data locality" is a big issue for performance. So older parsers/processors not built for locality may likely not take advantage of that performance (and, indeed, may perform worse than they used to.) (For an example of this, note how SAXON's recent TinyTree implementation avoids "arrays of structs" in favour of "structs of arrays".) It has been noted that in some cases, data locality can be as important for high performance as choosing a great algorithm.
What might an parallel parser for XML look like? (There has been many papers on this: IBM and Simon Fraser University for example.)
We read our document into a buffer. We run, say, 8 parallel process at the same time, each on 1/8 of the document. Each process reads 128 bytes at a time SIMD and checks all of them for the only delimiters we care about: < and & and =, and perhaps for bytes or words with illegal control characters. We memoize these on simple dynamically-resizing arrays. By counting them, we know the maximum bound of them we know the maximum bound of the arrays needed for our DOM, which we allocate.
Then we can parse the document a little further: we again take our 8 worker processes, each of which processes 1/8 of the < array, and looks at the subsequent character for /, ? Or !
(When any neighbouring subsequences are finished, you tidy up any mess or confusion from the division.) Determining these lets us build our basic DOM structure, 8n the arrays which take much less space than pointers (and so are faster to iterate through.)
Note that now we are only looking at delimiters we know are tags, not any characters in between and without switching to a different parsing mode (e.g. For CDATA sections.) Because < is never used in data (no CDATA sections or attributes) in this version of XML there is never any context to cause branching in the parsing (e.g. from an entity causing branches in the parsing or loss of data locality.
At this stage, we have a workable DOM, but it is a lazy Dom: we have not actually parsed the data or attributes or namespaces.
(But because, say, we don't allow namespace defaulting or redefinition actually we have effective namespacing already: if element names are different, they are different names: no need for context xmlns: attributes in order to allocate names to namespaces. In fact, we can launch some background thread to go through the document and complete namesakes processing ignoring that any other attributes have not even parsed: it can iterate over the = array, so it no parsing is involved except for finding any preceding xmlns etc.)
So this DOM has lazy parsing (which I believe MSXML used to an extent): when you want to access some text of an element or attribute you know from the & array whether it has any entity reference in it (because the character position in the raw data is always the character position in the document). If it has, you substitute the entity text in-place (and pad with zeroes or shuffle the remaining text.) Because the entity values are never longer than the reference, there is no need to allocate extra buffer space for the resolved text.
Now at this stage, if your input data was UTF-8 and your API expected UTF16, you would indeed have to copy over the buffer (and perhaps allocate a buffer). But the job of transcoding from UTF8 to UTF16 would use SIMD methods. About 18 years ago, on my old ORVILLE blog, I showed a trivial SIMD speed up to skip ASCII substrings in chunks of I think 16 bytes at a time. And there are now a few state machine based SIMD approaches that are lightning fast too.
This is just one approach: there are many others, and all the stages may have alternatives or optimizations. I am just trying to give a concrete example. Some of these stages could be done in current XML in theory, but the reality is that it is too hard or complex.
So let's take this and compare it to a current worst case XML parse to DOM with input UTF8 and API using 16 bit characters. Our application is a simple one. We want to find if the last element in the document has a value "Continue", meaning our application should ask for a follow-up document or whatever. Say we have a million byte ASCII file
In the current XML, we may transcode all the UTF8 for into UTF16 first. Web then go over every character in sequence and parse them and check characters. As we go we probably have to reallocate arrays for our DOM several times, or perhaps allocate hundreds of thousands of discrete objects. Finally we navigate down the tree.
Contrast with this thought-parser: we delimiter scan taking up to 1/8 x 1/128th of doing it sequentially. We parse into our element tree without requiring multiple buffer allocations (and sequential access find in with pre-fetch behaviour.) We use the DOM to find that last element, and only then see if it has entity (or numeric, same deal) references and convert to UTF-8. We do have not needed to parse or deference any attributes or any other text.
(And, whatever well-formedness checking we retain from XML, that has not been done on the fly, we can do in a background processes, and so make it available before any final comits by the application if needed.)
(Now I am aware that with virtual servers getting guaranteed access to multiple threads is not automatic. And that, sometimes you care more about the total time that processing a batch takes more than how fast a single document from a queue can be dealt with.)
So, to summarize, if we start with a highly parallelized system like this in mind, and work our way back to what XML needs to ditch to make it tractable, we get a completely different set of features to ditch than minimalist "elegance".
There is no need to remove entity references where the entity text is less than or equal to the size of the references and the implementation is trivial (given they are predefined, so you don't need a DOCTYPE, and you are already looking for numeric character references.) Similarly, there is no real cost in supporting comments or PIs, since we already look for what comes after < anyway.
Instead, to get the parsing speed, we turn "state into streams".
I hope this gives some idea of the whats, hows and whys. Obviously the kind of system above would be a thing you would use modern C, C++, or Rust to implement, as the parallel operations are hit and miss for the Java/C# ecosystems, I gather.
Regards
Rick
[1]
https://www.sciencedirect.com/topics/computer-science/single-instruction-multiple-data
"For several decades I have dabbled with methods to speed up parsing UTF-8 and XML using SIMD and parallel parsing: my conclusion is that the approach I am suggesting here is the
only feasible way for XML to not be sidelined as slow and complex. I think the lack of papers and experience demonstrating otherwise indicates it too.)"
(Here in Sydney we are in lockdown again, after an exiled year of almost no cases, Delta broke through, and we are trying to eliminate it. Taiwan successfully eliminated it this month, so maybe we will: elimination is a feasible strategy
on islands, rather than just suppression. I get my 2nd vaccine tomorrow.)