Re: [xml-dev] MicroXML

On Dec 13, 2010, at 01:01, James Clark wrote:

> � �http://blog.jclark.com/2010/12/microxml.html

> � � � � MicroXML - by this I mean a subset of XML 1.0 that is not intended to replace XML 1.0, but is intended for contexts where XML 1.0 is, or is perceived as, too heavyweight.

Who would implement MicroXML instead of implementing XML 1.0? That is, what problem is being solved and for whom?

My perspective is rather more data-centric. �The key question from this perspective is why would somebody choose to restrict the format of the data they are producing to MicroXML rather than avail themselves of the full feature-set of XML 1.0. �I would give two reasons: simplicity and polyglot capability.

Why is simplicity a good thing for a data format? �First, I think it is easier to produce data that conforms to some format if you can easily understand what the rules for conformance to that format are. �Second, I believe simplicity helps with data longevity. I would like to be able to use data I create today in ten or twenty years time, when perhaps I am unable to run any of the software I have today; the simpler the data format, the easier that's going to be. �Third, I believe simplicity helps with data reuse. �I don't want to always have to rely on complex libraries. Libraries often come with baggage and assumptions that I may not wish to accept. �Availability of libraries may limit my choice of programming language. �In the long run, I think data reuse is facilitated if somebody can whip up a parser for a format in a few hours. �Note that simplicity is not just about the format; it's also about the specification.

As for the polyglot capability, it makes documents more convenient to work with. �People can cut and paste polyglot content and reuse it without problems in both XML and HTML5 contexts. �Workflow is simplified compared to XML because I don't have to transform to HTML5 before using HTML tools or user agents. �Workflow is simplified compared to HTML5, because I don't have to use special parsers or transformations before using XML tools.

So back to your original question: who would implement it? �I think the old principle of "be conservative in what you do, be liberal in what you accept from others" is still a good one. �So I see it being more useful to people and software that are producing content, rather than to software that is consuming content. �However, I can also see it being useful to other specifications that feel that full XML is not appropriate (of which there are already many); instead of a specification having to roll its own subset of XML, they could simply reference MicroXML.

�

I think making the author jump through hoops in order for the consumer to be able to use XML tools is the wrong solution.

Given appropriate software support, creating MicroXML should be _easier_ than creating XML or HTML5 because it's so much simpler and more understandable. �I don't view MicroXML as making the author's life harder.

�

I think the right solution is that the consumer uses an HTML5 parser that exposes an XML infoset instead of using an XML parser at the start of the pipeline.

That's a fine solution if people want to keep their data in unconstrained HTML5. �I don't. �I want to use XML, because I want to be able to use XML technologies (like RELAX NG) �as part of he creation process to constrain which HTML elements and attributes I use. This is not just a matter of white-listing particular elements and attributes. �For example, I want to be able to control sectioning (e.g. always using <h1> as the first child of a <section>), and to constrain the use of the class attribute. �Your HTML5 parser isn't going to help if, for example, I am authoring using nxml-mode.

�

> It would be great if HTML5 provided an alternate way (using attributes or elements) to declare that an HTML document be parsed in standards mode. Perhaps a boolean "standard" attribute on the <meta> element?

That would fail to enable the standards mode in browsers that are already out there, so I can say with confidence that HTML5 isn't going to change like this.

I will take your word for it, but I would like to understand the logic better. Surely implementing HTML5 involves the browsers changing their behaviour. Why does the fact that a proposed change doesn't work in browsers today mean that it cannot be adopted by browsers in the future?

> I believe MicroXML should not impose any specific error handling policy;

This is sure recipe for an interoperability failure. Well-specified behavior in error situations at least leads to interop even if the results are nonsensical at times. I think the right way to spec any successor of XML is to specify a normative tokenizer state machine in such a way that in every state, any possible input character always has a well-defined transition (like the HTML5 tokenizer has).

I think that would be a big mistake for the MicroXML specification. �It would make it far too hard to understand and would limit its applicability. HTML5 is taking a very unusual, controversial approach. �Most specifications manage to achieve good interoperability without taking this extreme approach (JSON is a good example). I don't think the HTML5 specification style is one that other specifications should follow.

If there really is some situation where interoperability of MicroXML parsers with respect to arbitrary byte streams is necessary, then a separate spec can be layered on top of the MicroXML spec.

HTML5 doesn't support Namespace declarations in the text/html syntax. However, the data model HTML5 uses for the document tree has Namespaces.

I think the inconsistency between the HTML syntax and the DOM with regards to namespaces is going to cause lots of confusion for DOM users.

If the data model doesn't support namespaces, how would one distinguish {http://www.w3.org/1999/xhtml}a and {http://www.w3.org/2000/svg}a in code path does things with the data model?

They would use the context (ancestor elements and their attributes) of the element. �You do this all time in XML processing. �Inherited attributes (like xml:lang) are a very common pattern. CSS has a special selector based on the inherited value of xml:lang; it might be handy to have the same thing for xmlns.

> � � � � � � � � An element probably also needs to have a flag saying whether it's an empty element. This is unfortunate but HTML5 does not treat an empty element as equivalent to a start-tag immediately followed by an end-tag: elements like <br> cannot have end-tag, and elements that can have content such as <a> cannot use the empty element syntax even if they happen to be empty. (It would be really nice if this could be fixed in HTML5.)

It can't due to existing content.

I would like to understand this a bit more. �We have two distinct cases:

(a) Allowing something like <br></br>

(b) Allowing something like <a/>

For (a), wouldn't existing parsers simply ignore the end-tag? �If so, why does existing content prevent allowing the end-tag?

For (b), I'm guessing existing parsers treat <a/> like <a>, �so I can almost see why they would have a problem (assuming there are documents that do <a/> when then really want <a>, which seems a bit unlikely). �But I thought HTML had a distinct standard parsing mode. �Why can't the standard parsing mode (enabled by something that existing documents don't do, like an appropriate <meta> tag) treat <a/> as <a></a>?

I am genuinely mystified by the backwards compatibility design constraints of HTML5, so any illumination you can provide would be welcome.

James

�