Re: ***SPAM*** [xml-dev] Re: The Goals of XML at 25, and the onething th

Hi Liam too!

On Thu, 22 Jul. 2021, 07:42 Liam R. E. Quin, <liam@fromoldbooks.org> wrote:

On Wed, 2021-07-21 at 14:29 +1000, Rick Jelliffe wrote:
> If I were to make up some scope goal for an evolution of XML I would
> say:

This isn't the right questiom. Since what you are describing isn't XML,
you need to ask instead, What could we design to replace XML as an
offering in places where XML isn't today the best choice.

Fair enough.

Many people on this list (and elsewhere) have changes they'd like to
see. Get rid of CDATA sections, remove the XML Declaration & allow
multiple top-level elements; remove entities; remove mixed content, or
mark it syntactically; allow overlap; much more.

Yes.

If I can put it in too-simplistic Kraftwerkian terms:

The design philosophy of SGML (and XSD!) was maximal maximalism: if anyone important needs a feature, add it.

The design philosophy of XML was minimal maximalism: if many people may need a feature, keep it but ditch everything else.

The design philosophy of the various simplified XMLs was minimal minimalism: if even one person doesn't absolutely need a feature, remove it.

The appropriate design philosophy I am suggesting wrt XML features is maximal minimalism: ONLY remove features if they compromise the technical criteria (e.g. the parallizability). (But also add new features, because merely giving an alternate syntax for JSON is a step backwards not forward: noise nit signsl.)

One reason i've found that people say XML databases are slow is that
they believe the database parses the entire XML database from disk for
each query. I ask them if they think a relational database does the
samewith comma-separated value files for each table, at which point
enlightenment usually dawns.

Sure.

But after acknowledging that, we could ask what extra features might be needed in a fairly efficient-to-load schema-less load/dump text-format for a semi-structured DBMS. For example, would building in xml:id reduce the gap? Would defining a PI for indexes, or for alternate root locations, etc, be helpful? Would allowing multiple roots help? Would an implied top root (implied by the parser) with attributes from the MIME header be useful for conveying storage metadata?

Calling XML a Document Interchange Format might be a win in that
regard.

If you just want faster parsing, look at the work done at Intel on
paralle parsing, and also at reduced-entropy compressed parse-event
streams (EXI).

Yes, I know Intel's work, it is one of the bases of my opinion.

Lets look at https://software.intel.com/content/www/us/en/develop/articles/xml-parsing-accelerator-with-intel-streaming-simd-extensions-4-intel-sse4.html for example. They give SIMD in rules to check in parallel data and attribute content; because range checking takes 60% of the loop. They do the checking before entity expansion, but find a complexity because of different modes:

Their terminology is wrong, but their comment is useful:

"XML documents are made up of storage units called entities, like Character Data, Element, Comment, CDATA Section, etc. Each type of entity has its own well-formed definition that is a serious of character range rules.[1] The main work of Intel XML parsing is to recognize these entities and their logic structures."

How do they handle these checks to takes advantage of SIMD? By simplifying them: I *think* they only perform only one check on only the leading byte of any UTF8 sequence.

So this provides some concrete information on what the shape of a simplification of rules might be: reduce the number of different modes of allowed characters in different contexts. Or some other conclusion.

Now you may be surprised: as no-one, I think, was more vociferous to keep XML as a text format (no invisible characters) and against non-human readable names, though I ultimately thought that XML 1.1's simplifications (do I have my specifications right) were prudent, except for the fatal IBM space. Humans reading it is critical for XML.

So why am I suggesting it here, letting the my new machine masters have their soul-less way with us? Because we already have XML: character ranges should be a new decision in a new dialect in order to open new doors, to occupy current sweet spots and to find new niches.

(Note, banning literal use of delimiters in modes, such as removing cdata sections and allowing references in pis, comments etc, this paper kinda suggests could eliminate model checks for inappropriate milestone-delimiter-character use, as well as the need for modal parsing for milestone-delimiter-characters I have mentioned before.)

If you read http://www.balisage.net/Proceedings/vol1/html/Wu01/BalisageVol1-Wu01.html you can see their method only make sense if there is no CDATA section containing a <. It is not one of their exception cases. I would use them as an example of where, if you read most of the parallelization papers, they come unstuck for that case.

Now, sure, there could be exception handling or better handling, but e.g. having to restart parsing is extra coding, and fights against the point.

(some specific comments below)

I'll comment seperately.

Hoping you are well,

Rick