Re: [xml-dev] XML 2 so far

Kurt Cagle
XML Architect
Lockheed / US National Archives ERA Project

On Sun, Dec 12, 2010 at 10:48 PM, Kurt Cagle <kurt.cagle@gmail.com> wrote:

Cool thread by all, and I think its beginning to ask the hard questions:
�

> (1) allow leading whitespace before the XML declaration
> � �<?xml ...?>
>
> � �Why: It happens often by mistake, or as a result of copy and
> � �paste, and results in the declaration becoming interpreted as a
> � �processing instruction, or in a possibly-cryptic error message.
> � �This seems not to be controversial.

I guess I should comment to make it controversial:

With my implementor hat on, I think it's bad design that you have to consume an arbitrary number of bytes before committing to a decoder. I'd go in the other direction and consider the possibility of having an arbitrary number of white space characters after <?xml but before the encoding pseudo-attribute a design flaw in XML.

Moreover, I think it's bad to have a reliable magic number within a fixed number of bytes from the start of the file, so I think it's a flaw that <?xml isn't required and making it potentially appear at a later offset wouldn't be an improvement.

Most of the problems with this come from the malcoding of RSS documents. I'd argue that a regex filter preprocess of such files would make not just this but a number of issues with XML go away.

> (2) character set
> � �require the use of utf-8, or of utf-8 and -16, and forbid others.
> � �Not complete consensus here.

No one should use anything except UTF-8 over the wire. UTF-16 is a legacy encoding.

As for "require", the big question is if you want XML 2 processors to be able to consume existing XML 1.0 content. If yes, you can't require stuff. If no, failure due to lack of positive network effects is likely.

For XML5, it is a goal that an XML5 processor will expose the same infoset as a non-validating, non-external-entity-resolving XML 1.0 processor when fed input that is a namespace-well-formed XML 1.0 document. (Well, at least when the document doesn't have an internal subset.)

Much of the difficulty here comes with legacy code; there's a lot of XML that was encoded as ISO-8559-1 early on that's still in the system. Agreed, would like to see UTF-8 become standard.�

�
> (3) document type declaration - external DTD
> � �Remove external DTDs.
> � �Not complete consensus on what to do with entities.

I say predefine all the HTML5 named character names that end with a semicolon. (Except in XML, you wouldn't consider the trailing semicolon part of the name.)

Agreed - entirely too much of my career has been spent recoding HTML encodings to their numeric equivalents. The encoding tables are well defined and would not take up a significant amount of memory or processing time on today's systems. There is some interesting work that was done in XSLT2 on character encodings and mappings that should also be pushed into the parser.

�
> (4) internal subset (e.g. element and entities declared in DTD-style
> � �notation at the start of a document)
> � �I don't see consensus here. �People do want a way to define
> � �"macros" or something similar that can appear in attribute values as
> � �well as in elements, and XInclude can't do that.
�

For XML5, I'd like to get rid of internal subset processing. The main problem is that existing XML content on the Web includes SVG files written by Adobe Illustrator, and those files not only have an internal subset but define namespace URLs as entities there and later use those entities in namespace declarations. (I'd be interested in knowing who at Adobe thought this was a good idea.)

The fear of getting dragged into implementing internal subset processing is probably the main reason why I haven't written an XML5 parser, yet. In SGML and in SGML-inspired languages, the number of tokenizer states required for a piece of syntax is inversely proportional to the usefulness of the piece of syntax. :-(

Agreed. Internal subset processing introduces semantics and complexity that would be better handled via a transformation process or some other formal processing tool post facto.

�
> (5) multiple root elements
> � �Allow multiple root elements in a document.
> � �Why? Because people want it. There's no technical need.
> � �On the other hand, it may break existing APIs and tools.
> � �Seems to be weak consensus on doing this one.

Seems like a recipe for severe API incompatibility.

This is one area where I'd be inclined to disagree. I think that there is a technical need, though not necessarily one that shows up in HTML. The primary use case I see here comes in query operations; most queries return multiple nodes of content (thinking XML databases here), with the enclosing node added primarily because XML currently does not support it (this is akin to retrieving a JSON array).�

�
> (6) Lax syntax and error recovery
> � �There's strong demand to allow processors to do error recovery,
> � �from some user communities. �This mostly seems to me to be
> � �Web browser programmers who deal with faulty RSS a lot; on the
> � �other hand, e.g. SOAP people would fight hard to keep this out
> � �(and it's certainly not a feature of JavaScript or JSON either).
> � �Not clear consensus here.

Making a new version of XML and making it Draconian *again* would truly be tragic.

I've long referred the lax syntax argument as being the "Grandmother Argument" - that my grandmother should be able to write invalid (fill in the blank language) and the system should be able to handle this laxness. It's a weak argument in HTML (if only because I believe that the amount of HTML being written by hand is a small and (more importantly shrinking) percentage of the overall production of HTML as more and more of it gets produced by automated mechanisms), but it's a terribly argument in XML, in great part because the only way you can derive even marginal semantics is by incorporating an XSD or similar type definition language, and the ability to introduce mechanisms to compensate for such laxness assuming a greater degree of competency in schema design than I've seen evinced in most XSD developers.�

What this does imply is that if a decision to create lax XML is permitted, there needs to be a way of introducing into the schemas some way of defining how such laxness is handled - This would be analogous to saying that if you have a P tag that the tag would be lax (would resolve with no terminating tag) if a given set of opening tags were encountered (,<DIV>,<Hn>, etc.). I don't necessarily see this as being a bad solution, but it would put more onus on the schema developer and would require rethinking XSDs in particular. Other areas, such as inversions () might also require such a set of rules.

The question here is whether this benefits any language other than HTML?�
�

> (7) Minimization
> � �This overlaps with No. 6, lax syntax. �Many people want to use
> � �a terser syntax, or have it as an option. �There is not (yet)
> � �strong consensus on what that should be. �Some people want
> � �<e>....</> or <e/..../ as per SGML. But there is not strong
> � �support for the exact SGML OMITTAG rules I think (which are
> � �complex and require a DTD)
>
> � �Neither is there support for DATATAG or the other SGML features
> � �exactly, but there do seem to be people who want some sort of
> � �terser markup.
>
> � �There has even been a LISP-like syntax suggested.
> � �The counter-arguments are usually simplicity and robustness.
> � �Not yet consenus.

FWIW, you can't have this *and* also have convergence between XML and HTML.

This is again an area where an E4X-like language would prove beneficial. If XML was a native format in JavaScript, then you could readily have JSON of the form {"a":"foo","b":10,"c":<bar><bat>text</bat></bar>} which would readily resolve all of these issues, which have been proposed largely because of the challenge of mixing JSON and XML. Minimization is not an issue on the XML side - outside of HTML, most people who work with XML have become quite accustomed to its form, and minimization would likely add a considerable learning curve and overhead to the process.

> What is the business case here?

That's indeed the big question. At TPAC, TimBL said on stage (roughly, not exact quote) that XML is used too much in the enterprise for XML to change.

The business case here is making XML more workable on the browser in the enterprise context, making JSON a reasonably mechanism for the transmission of multiple XML content as well as Javascript encodings, and better supporting both the rich data cases where document content plays into it. Again, I think that E4X may very well be the model that should be looked at, because it does a decent job of mixing the two message metaphors and has had the benefit of solid real world implementations. I'd call that a huge win.

Kurt Cagle