Re: [xml-dev] Re: Is XML only half finished? The X Refactor

On Mon, Feb 12, 2018 at 11:02 PM, Cecil New <cecil.new@gmail.com> wrote:

Should make this a Google doc so everyone can view and comment on it!

On Sun, Feb 11, 2018, 10:35 PM Rick Jelliffe <rjelliffe@allette.com.au> wrote:
Ooops, format got lost. Try again.

Where is XML thriving?

Industrial document production using semantic markup: the traditional SGML market of high complexity, multi-publishing, and long life span.
Desktop office formats using XML-in-ZIP: ODF and OOXML
Data that has mixed content
Data that absolutely needs schema validation as it goes through a network of processing: such as HL7

Where is XML not thriving?
SGML on the Web
Configuration files (property files/JSON/YAML/etc)
Minimal markup (markdown wins)
Where the complexity of XSD etc does not provide a targetted enough bang per buck.

What have been the pain points for XML over the years?
Inefficiency. Processing pipeline often need to reparse the document. All processing is done post-parse, and consequently XML pipelines are unnecessarily heavyweight.
It is not minimal. An even tinier version might be faster, easier to implement, neater.
Maths users need named entities.
Domain Specific syntaxes show that terseness should not be dismissed as being of minimal importance.
DTDs were simplified and made optional on the expectation that a downstream process could then do the validation: however, only Schematron has a standard format for reporting types (linked by XPath) and errors. There is no standard XML format for the XSD PSVI.
Namespaces are crippled because they are not versioned: and processing software does not support wildcarding or remapping. Consequenty, changing the version of the namespace breaks software.
XML element or attribute names, PIs and Comments strings that containing non-ASCII characters may be corrupted by transcoding out of Unicode.
A bit too strict on syntax errors.
XSD tried to elaborate every possible use of parameter entities and tease them out into separate facilities. It did not reconstruct several major ones, notably conditional sections. This has the consequence of reducing XML's efficiency as a formatting language.
XInclude only partly reconstructed the functionality of general entities.
The XML specification sections on how to represent "<" in entity declarations gives me a headache.
Little domain specific languages have not gone away: we have dates, TeX maths, URLs, JSON, CSV, and so on.
XSLT is becoming more complex and full-featured with the result that there must be fewer complete implementations. Because there is no-where else, to go, it has needed to add support for JSON, streaming and database-query-influenced XPaths.

So...

Is there a way to address these pain points and evolve XML? I think there is, and to clawback many features lost from XML while keeping a neat, simple pipeline that causes the least disruption to current APIs.
Here is what I am thinking. XML is evolved into a notional pipeline of up to five steps: XML Macro Processor, Fully Resolved XML Processor, Notation Expander, Validation Processing, and Decorating Post-Processor. Lets call it "The X Refactor".

1) "XML Macro Processor" Full featured macro-processor, taking the features of M4: text substitution, file insertion, conditional text. Just before the advent of XML, Dave Peterson had proposed to the ISO committee enhancing the marked section mechanism with better conditional logic (nots, and, or, etc), so this is not a left-fielddea. (This is an enhanced standalone version of what SGML calls the "Entity Manager". )

Suggestion: Input: bytes. Output: fully resolved XML Unicode text.
Read XML header and handle transcoding to Unicode.
Parse for <![ and subsequent ]> (or ]]>)on stack, and perform macro expansion and interpretation. I.e. strip DOCTYPE declaration, perform inclusions, don' pass on sections, delimit data in CDATA sections or CDATA entities to text with numeric character references. The value of variables of marked sections (while looking like PE references i.e. %aaaaa;) do not have their definition taken directly from the prolog but must be provided out of band, i.e. as an invocation configuration. (This is a "hygenic" macro processor, because macros cannot be defined in the document and therefore risk complicated meta-macro-hacking.)
Expand general entity references to direct unicode characters. Entity references (while looking like General entitity reference ie &aaaa; ) are not defined in the prolog by must be provided out of band in some Entity Declaration Document. The standard ISO/MathML entity sets are predefined.
Benefits:

Allows major simplification of the XML processor.
Support lightweight customizable documents, without having to load a whole document tree.
Reconstructs SGML's marked section mechanism
Removes the vexed issue of people who want to use XSD and named character references
Optionally supports ";" ommissibility on entity and numeric character references, a la SGML.
Documents can be transcoded without corrupting non-ASCII characters in names, PIs and comments.
The macro processor removes the need for parameter entities, because it can be used on a schema or other XML document. And it provides a way of customizing schemas using a general mechanism.

Incompatabilities:
Entity and numeric character references will be recognized where they currently are not.
Edge cases will exist, such as where an attribute value contains <![ it will be recognized.
Marked sections is not defined as synchronised with element tags, which could allow various hacking problems. (Implementations are not required to support marked sections in attributes or that are asynchrous to the element tagging and such markup is deprecated and unsafe.)

2) Fully Resolved XML Processor Stripped back XML processor without encoding handling, DOCTYPE declaration, CDATA sections, entity references, numeric character references.

Suggestion: Input: Unicode text. Output: XML event stream.

Recognize start-tags, end-tags, comments and PIs.
As error-handling, may allow STAGC ommission like SGML and HTML <p<b>
As error-handling, may allow start- and end-tag impliciation, using a Error Handling Specification document, like SGML and HTML.
An entity reference would be an undeclared error.
A numeric character reference would be accepted but generate a warning.

Benefits:
The input is the ultra minimal XML that some have been calling for. Rather than "simplifying XML" by abandoning docheads, we refactor XML to support both docheads and people wanting a minimal XML.
Conforming subset of current XML
Compatible with SAX

   Incompatabilities:
Allowing minimization and tag implication may be an incompatability, but it would be an error handling feature that does not need to be enabled.

3) Notation Expander Process the contents of some element and replace delimiters with tags. The processor uses a Notation Definition Specification, which uses regular expressons and reuses the same tag implication fixup as the Error Handling Specification of the Fully Resolved XML processor above. The elements generated are synchronized with the containing element. Element markup inside the notation is allowed or rejected (as a kind of validation)

Specialist notation processors are also possible: namely for JSON, and for the QuickFixes (Schematron parse and fixup), and to reconstruct the XML SHORT REF mechanism. Stretching it a bit, and HTML 5 style element housting might go in this stage too.

Input: XML Event Stream. Output: XML Event Stream.

   Benefits: This is to reconstruct the idea of the SHORT-REF>ENTITY-REF->MARKUP mechanism in XML, where in a context you can define that a character like * should be shorthand for entity reference &XXX; and that this entity could contain a start tag <XXX> which would then be closed off by implication or explicitly or by some other shortreffed character.

Short refs had three kinds of use cases:
first was for repetative tabular data, such as CSV, where the newline and , or | characters could be expanded and recognized. This use case would be supported.
second was for embedded little languages, for example for mathematical notation. However, the absense of a mechanism to declare infix shortrefs meant that this was crippled. This use case would be supported
third was for markdown-style markup. This is not a supported use-case, as there is a thriving markdown ecosystem and community doing fine without it, and because of the issue of double delimiting
Support some simple parsing tasks that otherwise might require heavyweight XSLT, but do it within a more targetts regex framework.
Compatible with SAX

4) Validation Processing
Input: XML stream Output: Enhanced XML Event Stream (PSVI), or [XML input stream, XML validation report langage'
This can use any subsequent DTD stage, or XSD, or an combination of the DSDL familiy (RELAX NG, Schematron, CRDL for character range validation, NVRL for namespace remapping, and so on.)
Benefits:
* The technology for this part of the tool chain is available
* Except that there needs to be an "XML" output from validation. Consequenly either a type-enhanced standard SAX (for a Post Schema Validation Infoset), or a dual stream of the input plus an event stream of the validation report linking properties and errors to the original document (i.e. ISO SVRL)

5) Decorating Post Processor This would perform simple transformations steamable insertions into the event stream. (It could also be run before validation if needed.)
Suggestion: Input: (enhanced) XML Event Stream, Output: (enhanced) XML Event stream
Benefits:
Support attribute defaulting taking over from DTD. RELAX NG and Schematron per se do not alter the document stream.
Reconstruct the LINK feature of SGML, that allows bulk addition of attributes (such as formatter properties), reducing the attributes needed to be marked up or in the schema. Allows process-dependent attributes to be added on the fly.
Supports feature extraction and markup. For example, a Schematron processor could be made that injects into the event stream extra attributes based on the new sch:property capability of Schematron 2015.
Support some simple decoration tasks that otherwise might require heavyweight XSLT.
Compatible with SAX

What would it take?

1) Split apart an XML Processor into two parts. Dump DOCTYPE processing. Define and add a marked section logic expressions ( AND | OR | etc) to the Macro processor. Implement as a text pipe or as an InputStream. Add the error recovery. (An existing XML processor will accept Fully Resolved XML as is.)

2) Make some generic notation processor (anotated BNF + tag implication).  A standard language should be adopted.. Make specialist processor for math, and XML Quick fix. Allow invocation either by a PI as the first child of the parent to flag the notation, or by some config file. Implement as text pipeline or SAX stream processor.

3) Validation technology exists. But how to sequence it is an open question (that DSDL punted): please not XProc. But does SAX support the PSVI?

4) A simple streaming substitution language would be trivial to define and implement as a SAX Stream. It would be a processing decision to add this, but there is no harm in notating this with a PI. A standard language should be adopted.

So I don't see this is very disruptive, at the API level

Afterthought:
20 years ago, when we were chopping up SGML to formulate XML, the thought was that we could afford to remove much useful functionality either because (such as with schemas) it could be upgraded into a different stage in the pipeline or (such as with conditional marked sections) because it was a back-end task suited inside servers rather than the wire format (SGML-on-the-Web.)

We left the job unfinished: the pipeline is incomplete, and the back-end uses turned out to be the main use-case and has been neglected. The aim is not to reconstruct all of SGML, and certainly not to make a monolithic system with lots of feedback: we don't need an SGML Declration 2.0! But I suggest that filling out the pipeline would support many use cases.

On Mon, Feb 12, 2018 at 2:32 PM, Rick Jelliffe <rjelliffe@allette.com.au> wrote:
Where is XML thriving?
* Industrial document production using semantic markup: the traditional SGML market of high complexity, multi-publishing, and long life span.
* Desktop office formats using XML-in-ZIP: ODF and OOXML
* Data that has mixed content
* Data that absolutely needs schema validation as it goes through a network of processing: such as HL7
Where is XML not thriving?
* SGML on the Web
* Ephemeral data interchange (JSON wins)
* Configuration files (property files/JSON/YAML/etc)
* Minimal markup (markdown wins)
* Where the complexity of XSD etc does not provide a targetted enough bang per buck.
What have been the pain points for XML over the years?
* Inefficiency. Processing pipeline often need to reparse the document. All processing is done post-parse, and consequently XML pipelines are unnecessarily heavyweight.
* It is not minimal. An even tinier version might be faster, easier to implement, neater.
* Maths users need named entities.
* Domain Specific syntaxes show that terseness should not be dismissed as being of minimal importance.
* DTDs were simplified and made optional on the expectation that a downstream process could then do the validation: however, only Schematron has a standard format for reporting types (linked by XPath) and errors. There is no standard XML format for the XSD PSVI.
* Namespaces are crippled because they are not versioned: and processing software does not support wildcarding or remapping. Consequenty, changing the version of the namespace breaks software.
* XML element or attribute names, PIs and Comments strings that containing non-ASCII characters may be corrupted by transcoding out of Unicode.
* A bit too strict on syntax errors.
* XSD tried to elaborate every possible use of parameter entities and tease them out into separate facilities. It did not reconstruct several major ones, notably conditional sections. This has the consequence of reducing XML's efficiency as a formatting language.
* XInclude only partly reconstructed the functionality of general entities.
* The XML specification sections on how to represent "<" in entity declarations gives me a headache.
* Little domain specific languages have not gone away: we have dates, TeX maths, URLs, JSON, CSV, and so on.
* XSLT is becoming more complex and full-featured with the result that there must be fewer complete implementations. Because there is no-where else, to go, it has needed to add support for JSON, streaming and database-query-influenced XPaths.
So...
Is there a way to address these pain points and evolve XML? I think there is, and to clawback many features lost from XML while keeping a neat, simple pipeline that causes the least disruption to current APIs.
Here is what I am thinking. XML is evolved into a notional pipeline of up to five steps: XML Macro Processor, Fully Resolved XML Processor, Notation Expander, Validation Processing, and Decorating Post-Processor. Lets call it "The X Refactor".

1) "XML Macro Processor" Full featured macro-processor, taking the features of M4: text substitution, file insertion, conditional text. Just before the advent of XML, Dave Peterson had proposed to the ISO committee enhancing the marked section mechanism with better conditional logic (nots, and, or, etc), so this is not a left-fielddea. (This is an enhanced standalone version of what SGML calls the "Entity Manager". )

Suggestion: Input: bytes. Output: fully resolved XML Unicode text.
Read XML header and handle transcoding to Unicode.
Parse for <![ and subsequent ]> (or ]]>)on stack, and perform macro expansion and interpretation. I.e. strip DOCTYPE declaration, perform inclusions, don' pass on sections, delimit data in CDATA sections or CDATA entities to text with numeric character references. The value of variables of marked sections (while looking like PE references i.e. %aaaaa;) do not have their definition taken directly from the prolog but must be provided out of band, i.e. as an invocation configuration. (This is a "hygenic" macro processor, because macros cannot be defined in the document and therefore risk complicated meta-macro-hacking.)
Expand general entity references to direct unicode characters. Entity references (while looking like General entitity reference ie &aaaa; ) are not defined in the prolog by must be provided out of band in some Entity Declaration Document. The standard ISO/MathML entity sets are predefined.
Benefits:

Allows major simplification of the XML processor.
Support lightweight customizable documents, without having to load a whole document tree.
Reconstructs SGML's marked section mechanism
Removes the vexed issue of people who want to use XSD and named character references
Optionally supports ";" ommissibility on entity and numeric character references, a la SGML.
Documents can be transcoded without corrupting non-ASCII characters in names, PIs and comments.
The macro processor removes the need for parameter entities, because it can be used on a schema or other XML document. And it provides a way of customizing schemas using a general mechanism.

Incompatabilities:
Entity and numeric character references will be recognized where they currently are not.
Edge cases will exist, such as where an attribute value contains <![ it will be recognized.
Marked sections is not defined as synchronised with element tags, which could allow various hacking problems. (Implementations are not required to support marked sections in attributes or that are asynchrous to the element tagging and such markup is deprecated and unsafe.)

2) Fully Resolved XML Processor Stripped back XML processor without encoding handling, DOCTYPE declaration, CDATA sections, entity references, numeric character references.

Suggestion: Input: Unicode text. Output: XML event stream.

Recognize start-tags, end-tags, comments and PIs.
As error-handling, may allow STAGC ommission like SGML and HTML <p<b>
As error-handling, may allow start- and end-tag impliciation, using a Error Handling Specification document, like SGML and HTML.
An entity reference would be an undeclared error.
A numeric character reference would be accepted but generate a warning.

Benefits:
The input is the ultra minimal XML that some have been calling for. Rather than "simplifying XML" by abandoning docheads, we refactor XML to support both docheads and people wanting a minimal XML.
Conforming subset of current XML
Compatible with SAX

Incompatabilities:
Allowing minimization and tag implication may be an incompatability, but it would be an error handling feature that does not need to be enabled.

3) Notation Expander Process the contents of some element and replace delimiters with tags. The processor uses a Notation Definition Specification, which uses regular expressons and reuses the same tag implication fixup as the Error Handling Specification of the Fully Resolved XML processor above. The elements generated are synchronized with the containing element. Element markup inside the notation is allowed or rejected (as a kind of validation)

Specialist notation processors are also possible: namely for JSON, and for the QuickFixes (Schematron parse and fixup), and to reconstruct the XML SHORT REF mechanism. Stretching it a bit, and HTML 5 style element housting might go in this stage too.

Input: XML Event Stream. Output: XML Event Stream.

Benefits: This is to reconstruct the idea of the SHORT-REF>ENTITY-REF->MARKUP mechanism in XML, where in a context you can define that a character like * should be shorthand for entity reference &XXX; and that this entity could contain a start tag <XXX> which would then be closed off by implication or explicitly or by some other shortreffed character.

Short refs had three kinds of use cases:
first was for repetative tabular data, such as CSV, where the newline and , or | characters could be expanded and recognized. This use case would be supported.
second was for embedded little languages, for example for mathematical notation. However, the absense of a mechanism to declare infix shortrefs meant that this was crippled. This use case would be supported
third was for markdown-style markup. This is not a supported use-case, as there is a thriving markdown ecosystem and community doing fine without it, and because of the issue of double delimiting
Support some simple parsing tasks that otherwise might require heavyweight XSLT, but do it within a more targetts regex framework.
Compatible with SAX

4) Validation Processing
Input: XML stream Output: Enhanced XML Event Stream (PSVI), or [XML input stream, XML validation report langage'
This can use any subsequent DTD stage, or XSD, or an combination of the DSDL familiy (RELAX NG, Schematron, CRDL for character range validation, NVRL for namespace remapping, and so on.)
Benefits:
* The technology for this part of the tool chain is available
* Except that there needs to be an "XML" output from validation. Consequenly either a type-enhanced standard SAX (for a Post Schema Validation Infoset), or a dual stream of the input plus an event stream of the validation report linking properties and errors to the original document (i.e. ISO SVRL)

5) Decorating Post Processor This would perform simple transformations steamable insertions into the event stream. (It could also be run before validation if needed.)
Suggestion: Input: (enhanced) XML Event Stream, Output: (enhanced) XML Event stream
Benefits:
Support attribute defaulting taking over from DTD. RELAX NG and Schematron per se do not alter the document stream.
Reconstruct the LINK feature of SGML, that allows bulk addition of attributes (such as formatter properties), reducing the attributes needed to be marked up or in the schema. Allows process-dependent attributes to be added on the fly.
Supports feature extraction and markup. For example, a Schematron processor could be made that injects into the event stream extra attributes based on the new sch:property capability of Schematron 2015.
Support some simple decoration tasks that otherwise might require heavyweight XSLT.

Compatible with SAX

What would it take?

1) Split apart an XML Processor into two parts. Dump DOCTYPE processing. Define and add a marked section logic expressions ( AND | OR | etc) to the Macro processor. Implement as a text pipe or as an InputStream. Add the error recovery. (An existing XML processor will accept Fully Resolved XML as is.)

2) Make some generic notation processor (anotated BNF + tag implication). A standard language should be adopted.. Make specialist processor for math, and XML Quick fix. Allow invocation either by a PI as the first child of the parent to flag the notation, or by some config file. Implement as text pipeline or SAX stream processor.

3) Validation technology exists. But how to sequence it is an open question (that DSDL punted): please not XProc. But does SAX support the PSVI?

4) A simple streaming substitution language would be trivial to define and implement as a SAX Stream. It would be a processing decision to add this, but there is no harm in notating this with a PI. A standard language should be adopted.

So I don't see this is very disruptive, at the API level

Afterthought:
20 years ago, when we were chopping up SGML to formulate XML, the thought was that we could afford to remove much useful functionality either because (such as with schemas) it could be upgraded into a different stage in the pipeline or (such as with conditional marked sections) because it was a back-end task suited inside servers rather than the wire format (SGML-on-the-Web.)

We left the job unfinished: the pipeline is incomplete, and the back-end uses turned out to be the main use-case and has been neglected. The aim is not to reconstruct all of SGML, and certainly not to make a monolithic system with lots of feedback: we don't need an SGML Declration 2.0! But I suggest that filling out the pipeline would support many use cases.