Re: [xml-dev] MicroXML

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Henri Sivonen <hsivonen@iki.fi>
To: "xml-dev@lists.xml.org List" <xml-dev@lists.xml.org>
Date: Wed, 15 Dec 2010 22:31:03 -0800

On Dec 13, 2010, at 19:27, James Clark wrote:

> On Tue, Dec 14, 2010 at 6:55 AM, Henri Sivonen <hsivonen@iki.fi> wrote:
>> Who would implement MicroXML instead of implementing XML 1.0? That is, what problem is being solved and for whom?
> 
> My perspective is rather more data-centric.  The key question from this perspective is why would somebody choose to restrict the format of the data they are producing to MicroXML rather than avail themselves of the full feature-set of XML 1.0.  I would give two reasons: simplicity and polyglot capability.

It seems a bit odd to self-impose restrictions on data if the consuming software is still an XML 1.0 processor.

> Workflow is simplified compared to HTML5, because I don't have to use special parsers or transformations before using XML tools.

I'm on a mission to make it so that an HTML5 parser will no longer be considered "special" as time goes on.

> So back to your original question: who would implement it?  I think the old principle of "be conservative in what you do, be liberal in what you accept from others" is still a good one.  So I see it being more useful to people and software that are producing content, rather than to software that is consuming content.  However, I can also see it being useful to other specifications that feel that full XML is not appropriate (of which there are already many); instead of a specification having to roll its own subset of XML, they could simply reference MicroXML.
>  
>> I think the right solution is that the consumer uses an HTML5 parser that exposes an XML infoset instead of using an XML parser at the start of the pipeline.
>> 
> That's a fine solution if people want to keep their data in unconstrained HTML5.  I don't.

The case where someone consumes their own data is a special case. Most of the time, one consumes content made by someone else.

> I want to use XML, because I want to be able to use XML technologies (like RELAX NG)  as part of he creation process to constrain which HTML elements and attributes I use. This is not just a matter of white-listing particular elements and attributes.  For example, I want to be able to control sectioning (e.g. always using <h1> as the first child of a <section>), and to constrain the use of the class attribute.

Validator.nu uses an HTML5 parser to feed Jing, so the situation with RELAX NG isn't a matter of fundamental incompatibility but a matter of the solution so far existing for Java only.

>  Your HTML5 parser isn't going to help if, for example, I am authoring using nxml-mode.

I believe Edward O'Connor is working on an nxml-mode-inspired HTML5 parser in Emacs Lisp. https://github.com/hober/html5-el/

>>>  It would be great if HTML5 provided an alternate way (using attributes or elements) to declare that an HTML document be parsed in standards mode. Perhaps a boolean "standard" attribute on the <meta> element?
>>> 
>> That would fail to enable the standards mode in browsers that are already out there, so I can say with confidence that HTML5 isn't going to change like this.
>> 
> I will take your word for it, but I would like to understand the logic better. Surely implementing HTML5 involves the browsers changing their behaviour. Why does the fact that a proposed change doesn't work in browsers today mean that it cannot be adopted by browsers in the future?

The standards mode is not a new feature. It has existed a bit over a decade. Changing the trigger mechanism would fail to activate the standards mode in existing browsers that have it. HTML5 generally uses existing stuff if available. For example, HTML5 doesn't rename <p> to <paragraph> or something like that just because there are also new features that need implementation work. This makes HTML5 work in legacy browsers as much as feasible.

>>> I believe MicroXML should not impose any specific error handling policy;
>>> 
>> This is sure recipe for an interoperability failure. Well-specified behavior in error situations at least leads to interop even if the results are nonsensical at times. I think the right way to spec any successor of XML is to specify a normative tokenizer state machine in such a way that in every state, any possible input character always has a well-defined transition (like the HTML5 tokenizer has).
>> 
> I think that would be a big mistake for the MicroXML specification.  It would make it far too hard to understand and would limit its applicability.

Error handling isn't meant to be understood. It's meant to produce well-defined and consistent results across implementations even when the results don't make sense. This way, if an author makes an error, all consumers behave the same way and the author doesn't end up relying on the particularities of one product. Authors unknowingly relying on product-specific error handling may lead to anti-competitive situations where new entrants to the market have to go on the treadmill to reverse engineer the incumbent market leader whose error handling content relies on.

> Most specifications manage to achieve good interoperability without taking this extreme approach (JSON is a good example).

Actually, I think JSON is an example of how interop failures arise when extensions are allowed but error handling is Draconian on one hand but on the other hand some impls aren't Draconian.

> I think the inconsistency between the HTML syntax and the DOM with regards to namespaces is going to cause lots of confusion for DOM users.

Yeah. Too bad the SVG WG took Namespaces seriously and introduced local name collisions with HTML. And too bad Netscape took Namespaces seriously and implemented DOM Level 2 in Gecko and others did so, too. (As of IE9 even IE.) Now the namespace-aware data model is part of the legacy that we need to drag around.

> I would like to understand this a bit more.  We have two distinct cases:
> 
> (a) Allowing something like <br></br>

There's legacy content that looks "wrong" if that doesn't render as two line breaks.

> (b) Allowing something like <a/>

There's legacy content that looks "wrong" if that's treated as an empty element.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Follow-Ups:
- Re: [xml-dev] MicroXML
  - From: rjelliffe <rjelliffe@allette.com.au>
- Re: [xml-dev] MicroXML
  - From: James Clark <jjc@jclark.com>

References:
- MicroXML
  - From: James Clark <jjc@jclark.com>
- Re: [xml-dev] MicroXML
  - From: Henri Sivonen <hsivonen@iki.fi>
- Re: [xml-dev] MicroXML
  - From: James Clark <jjc@jclark.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]