Yes I am very familiar with the XML Screamer paper, big fan, and it is one of the primary studies that informed many of the points I am making.
Lets note what they leave out :
XML Features Not Supported
- 1. DTD external or internal subsets. Note that external subsets are optional in the XML Recommendation, but internal subsets are required [21]. In this respect XML Screamer is non-conforming.
- 2. Support for encodings other than UTF-8 or UTF-16. Our architecture is in principle capable of supporting other encodings, but because our parsers are hand crafted to optimize for the characteristics of particular encodings, the work involved is significant.
- 3. Very large instance documents, i.e. those too large to fit in a contiguous memory buffer.
XML Schema Features Not Supported
- 4. Facets on simple types (these are accepted but not checked; among the facets not checked is the pattern facet.) [Note 5]
- 5. Non-deterministic content models (such as certain models with nested numeric occurrence constraints.) [Note 6]
- 6. Identity constraints (accepted but not checked)
- 7. Validity checking of types other than anySimpleType, date, integer, decimal, nonnegativeInteger, boolean, positiveInteger,negativeInteger, nonpositiveInteger, and string (all types are accepted, but validation of lexical forms and conversion to binary is available only for the listed types.)
In previous decades, as I read these kinds of papers and came up to these omission sections, my reaction was to doubt that the method was in fact practical. (Benchmarking against Xerces, the slowest parser, did not help, either: how it compares to MSXML is more compelling.)
I understand that a research paper has finite resources, but it seemed to me that the omissions were often not arbitrary but genuine pain points. And that there was a pattern to them.
And that made me think, is this actually a good rational basis for enhancing XML? Instead of seeing the implementation omissions (in this and most other papers) as implementation flaws, if not academic lapses, are they really "telling us" that the features are roadblocks which prevent or dilute many different implementation aproaches? Not drowning, waving.
Hence my starting proposal also follows 1 & 2, and it also (I think) draws from 3 the idea that we want to avoid anything that prevents in-place contiguous parsing. (It also only validates some simple primitive types.)
However, the problem of entity references causing extra buffer allocations, etc, isnt necessarily so: if a parse method can support Numeric Character References, then it can also support general entity references with the same characteristics: they contain no tags or references (CDATA entity) and which do not expand to more characters than the reference...e.g. the standard entity sets.
Hoping you are well,
Rick