Re: [xml-dev] XML 2 so far

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Henri Sivonen <hsivonen@iki.fi>
To: "xml-dev@lists.xml.org List" <xml-dev@lists.xml.org>
Date: Sun, 12 Dec 2010 18:58:11 -0800

On Dec 12, 2010, at 17:42, Liam R E Quin wrote:

> Here are my notes on feature requests so far for an XML 2.0.
> 
> After the items, a short discussion.
> 
> Liam
> 
> (1) allow leading whitespace before the XML declaration
>    <?xml ...?>
> 
>    Why: It happens often by mistake, or as a result of copy and
>    paste, and results in the declaration becoming interpreted as a
>    processing instruction, or in a possibly-cryptic error message.
>    This seems not to be controversial.

I guess I should comment to make it controversial:

With my implementor hat on, I think it's bad design that you have to consume an arbitrary number of bytes before committing to a decoder. I'd go in the other direction and consider the possibility of having an arbitrary number of white space characters after <?xml but before the encoding pseudo-attribute a design flaw in XML.

Moreover, I think it's bad to have a reliable magic number within a fixed number of bytes from the start of the file, so I think it's a flaw that <?xml isn't required and making it potentially appear at a later offset wouldn't be an improvement.

> (2) character set
>    require the use of utf-8, or of utf-8 and -16, and forbid others.
>    Not complete consensus here.

No one should use anything except UTF-8 over the wire. UTF-16 is a legacy encoding.

As for "require", the big question is if you want XML 2 processors to be able to consume existing XML 1.0 content. If yes, you can't require stuff. If no, failure due to lack of positive network effects is likely.

For XML5, it is a goal that an XML5 processor will expose the same infoset as a non-validating, non-external-entity-resolving XML 1.0 processor when fed input that is a namespace-well-formed XML 1.0 document. (Well, at least when the document doesn't have an internal subset.)

> (3) document type declaration - external DTD
>    Remove external DTDs.
>    Not complete consensus on what to do with entities.

I say predefine all the HTML5 named character names that end with a semicolon. (Except in XML, you wouldn't consider the trailing semicolon part of the name.)

> (4) internal subset (e.g. element and entities declared in DTD-style
>    notation at the start of a document)
>    I don't see consensus here.  People do want a way to define
>    "macros" or something similar that can appear in attribute values as
>    well as in elements, and XInclude can't do that.

For XML5, I'd like to get rid of internal subset processing. The main problem is that existing XML content on the Web includes SVG files written by Adobe Illustrator, and those files not only have an internal subset but define namespace URLs as entities there and later use those entities in namespace declarations. (I'd be interested in knowing who at Adobe thought this was a good idea.)

The fear of getting dragged into implementing internal subset processing is probably the main reason why I haven't written an XML5 parser, yet. In SGML and in SGML-inspired languages, the number of tokenizer states required for a piece of syntax is inversely proportional to the usefulness of the piece of syntax. :-(

> (5) multiple root elements
>    Allow multiple root elements in a document.
>    Why? Because people want it. There's no technical need.
>    On the other hand, it may break existing APIs and tools.
>    Seems to be weak consensus on doing this one.

Seems like a recipe for severe API incompatibility.

> (6) Lax syntax and error recovery
>    There's strong demand to allow processors to do error recovery,
>    from some user communities.  This mostly seems to me to be
>    Web browser programmers who deal with faulty RSS a lot; on the
>    other hand, e.g. SOAP people would fight hard to keep this out
>    (and it's certainly not a feature of JavaScript or JSON either).
>    Not clear consensus here.

Making a new version of XML and making it Draconian *again* would truly be tragic.

> (7) Minimization
>    This overlaps with No. 6, lax syntax.  Many people want to use
>    a terser syntax, or have it as an option.  There is not (yet)
>    strong consensus on what that should be.  Some people want
>    <e>....</> or <e/..../ as per SGML. But there is not strong
>    support for the exact SGML OMITTAG rules I think (which are
>    complex and require a DTD)
> 
>    Neither is there support for DATATAG or the other SGML features
>    exactly, but there do seem to be people who want some sort of
>    terser markup.
> 
>    There has even been a LISP-like syntax suggested.
>    The counter-arguments are usually simplicity and robustness.
>    Not yet consenus.

FWIW, you can't have this *and* also have convergence between XML and HTML.

> What is the business case here?

That's indeed the big question. At TPAC, TimBL said on stage (roughly, not exact quote) that XML is used too much in the enterprise for XML to change.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Follow-Ups:
- Re: [xml-dev] XML 2 so far
  - From: Amelia A Lewis <amyzing@talsever.com>
- Re: [xml-dev] XML 2 so far
  - From: Kurt Cagle <kurt.cagle@gmail.com>
- Re: [xml-dev] XML 2 so far
  - From: Liam R E Quin <liam@w3.org>
- Re: [xml-dev] XML 2 so far
  - From: David Lee <dlee@calldei.com>
- Re: [xml-dev] XML 2 so far
  - From: Henri Sivonen <hsivonen@iki.fi>

References:
- XML 2 so far
  - From: Liam R E Quin <liam@w3.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]