OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Is XML only half finished? The X Refactor

Where is XML thriving? 
* Industrial document production using semantic markup: the traditional SGML market of high complexity, multi-publishing, and long life span.
* Desktop office formats using XML-in-ZIP: ODF and OOXML
* Data that has mixed content 
* Data that absolutely needs schema validation as it goes through a network of processing: such as HL7
Where is XML not thriving?
* SGML on the Web 
* Ephemeral data interchange (JSON wins)
* Configuration files (property files/JSON/YAML/etc)
* Minimal markup (markdown wins)
* Where the complexity of XSD etc does not provide a targetted enough bang per buck.
What have been the pain points for XML over the years?
* Inefficiency.  Processing pipeline often need to reparse the document. All processing is done post-parse, and consequently XML pipelines are unnecessarily heavyweight.
* It is not minimal. An even tinier version might be faster, easier to implement, neater. 
* Maths users need named entities.  
* Domain Specific syntaxes show that terseness should not be dismissed as being of minimal importance.
* DTDs were simplified and made optional on the expectation that a downstream process could then do the validation: however, only Schematron has a standard format for reporting types (linked by XPath) and errors. There is no standard XML format for the XSD PSVI. 
* Namespaces are crippled because they are not versioned: and processing software does not support wildcarding or remapping. Consequenty, changing the version of the namespace breaks software. 
* XML element or attribute names, PIs and Comments strings that  containing non-ASCII characters may be corrupted by transcoding out of Unicode.
* A bit too strict on syntax errors.
* XSD tried to elaborate every possible use of parameter entities and tease them out into separate facilities.  It did not reconstruct several major ones, notably conditional sections. This has the consequence of reducing XML's efficiency as a formatting language.
* XInclude only partly reconstructed the functionality of general entities. 
* The XML specification sections on how to represent "<" in entity declarations gives me a headache.
* Little domain specific languages have not gone away:  we have dates, TeX maths, URLs, JSON, CSV, and so on.
* XSLT is becoming more complex and full-featured with the result that there must be fewer complete implementations. Because there is no-where else, to go, it has needed to add support for JSON, streaming and database-query-influenced XPaths. 
Is there a way to address these pain points and evolve XML?  I think there is, and to clawback many features lost from XML while keeping a neat, simple pipeline that causes the least disruption to current APIs.
Here is what I am thinking. XML is evolved into a notional pipeline of up to five steps:  XML Macro Processor, Fully Resolved XML Processor, Notation Expander, Validation Processing, and Decorating Post-Processor.  Lets call it "The X Refactor".

1) "XML Macro Processor" Full featured macro-processor, taking the features of M4:  text substitution, file insertion, conditional text.  Just before the advent of XML, Dave Peterson had proposed to the ISO committee enhancing the marked section mechanism with better conditional logic (nots, and, or, etc), so this is not a left-fielddea.  (This is an enhanced standalone version of  what SGML calls the  "Entity Manager". )

    Suggestion:  Input: bytes. Output: fully resolved XML Unicode text.  

2) Fully Resolved XML Processor  Stripped back XML processor without encoding handling, DOCTYPE declaration, CDATA sections, entity references, numeric character references.

Suggestion: Input: Unicode text.  Output: XML event stream. 



3) Notation Expander   Process the contents of some element and replace delimiters with tags. The processor uses a Notation Definition Specification, which uses regular expressons and reuses the same tag implication fixup as the Error Handling Specification of the Fully Resolved XML processor above.  The elements generated are synchronized with the containing element. Element markup inside the notation is allowed or rejected (as a kind of validation)

     Specialist notation processors are also possible: namely for JSON, and for the QuickFixes (Schematron parse and fixup), and to reconstruct the XML SHORT REF mechanism.  Stretching it a bit, and HTML 5 style element housting might go in this stage too. 

    Input:  XML Event Stream.  Output: XML Event Stream.

   Benefits: This is to reconstruct the idea of the SHORT-REF>ENTITY-REF->MARKUP  mechanism in XML, where in a context you can define that a character   like * should be shorthand for entity reference &XXX; and that this entity could contain a start tag <XXX> which would then be closed off by implication or explicitly or by some other shortreffed character. 

4) Validation Processing
Input: XML stream  Output: Enhanced XML Event Stream (PSVI), or [XML input stream, XML validation report langage'
This can use any subsequent DTD stage, or XSD, or an combination of the DSDL familiy (RELAX NG, Schematron, CRDL for character range validation, NVRL for namespace remapping, and so on.) 
   * The technology for this part of the tool chain is available
   * Except that there needs to be an "XML" output from validation. Consequenly either a type-enhanced standard SAX (for a Post Schema Validation Infoset),  or a dual stream of the input plus an event stream of the validation report linking properties and errors to the original document (i.e. ISO SVRL)

5) Decorating Post Processor  This would perform simple transformations steamable insertions into the event stream.  (It could also be run before validation if needed.)
   Suggestion: Input: (enhanced) XML Event Stream, Output: (enhanced) XML Event stream

What would it take?

   1) Split apart an XML Processor into two parts. Dump DOCTYPE processing.  Define and add a marked section logic expressions (  AND | OR | etc) to the Macro processor.  Implement as a text pipe or as an InputStream.  Add the error recovery.   (An existing XML processor will accept Fully Resolved XML as is.)

   2) Make some generic notation processor  (anotated BNF + tag implication).  A standard language should be adopted.. Make specialist processor for math, and XML Quick fix.   Allow invocation either by a PI as the first child of the parent to flag the notation, or by some config file. Implement as text pipeline or SAX stream processor.   

    3) Validation technology exists.  But how to sequence it is an open question (that DSDL punted): please not XProc.   But does SAX support the PSVI? 

    4) A simple streaming substitution language would be trivial to define and implement as a SAX Stream. It would be a processing decision to add this, but there is no harm in notating this with a PI.   A standard language should be adopted.

So I don't see this is very disruptive, at the API level

20 years ago, when we were chopping up SGML to formulate XML, the thought was that we could afford to remove much useful functionality either because (such as with schemas) it could be upgraded into a different stage in the pipeline or (such as with conditional marked sections) because it was a back-end task suited inside servers rather than the wire format (SGML-on-the-Web.)

We left the job unfinished: the pipeline is incomplete, and the back-end uses turned out to be the main use-case and has been neglected. The aim is not to reconstruct all of SGML, and certainly not to make a monolithic system with lots of feedback: we don't need an SGML Declration 2.0!  But I suggest that filling out the pipeline would support many use cases. 

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS