Re: [xml-dev] XML 2 so far

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
From: Liam R E Quin <liam@w3.org>
To: Henri Sivonen <hsivonen@iki.fi>
Date: Sun, 12 Dec 2010 22:36:53 -0500
On Sun, 2010-12-12 at 18:58 -0800, Henri Sivonen wrote:
> On Dec 12, 2010, at 17:42, Liam R E Quin wrote:
[...]
> I guess I should comment to make it controversial:

Thanks! Actually I happen to agree with you but was trying to be neutral
in that list (I know I didn't entirely manage it)...

An alternative I'll note for the next version of the list would be to
be clear that it's an error to have whitespace before the xml
declaration.

> [...]


>  I'd go in the other direction and consider the possibility of having
> an arbitrary number of white space characters after <?xml but before
> the encoding pseudo-attribute a design flaw in XML.

HTML's content-equiv header shares this problem. But maybe it's worth
limiting, I do see your point. Noted, at any rate.

> Moreover, I think it's bad to have a reliable magic number within a
> fixed number of bytes from the start of the file, so I think it's a
> flaw that <?xml isn't required and making it potentially appear at a
> later offset wouldn't be an improvement.
For any XML 2.0 there would have to be something at the start to
distinguish it, so I think this will have to happen anyway.

> > (2) character set
> >    require the use of utf-8, or of utf-8 and -16, and forbid others.
> >    Not complete consensus here.
> 
> No one should use anything except UTF-8 over the wire. UTF-16 is a legacy encoding.
> 
> As for "require", the big question is if you want XML 2 processors to
> be able to consume existing XML 1.0 content. If yes, you can't require
> stuff. If no, failure due to lack of positive network effects is
> likely.

This is the big question for any XML 2.0 work I think.
If it's compatible, you can't change enough to make it worth while;
if not, who wants it?  There are always specific communities (I'd
say XML5 is a good example) but it's hard to get cross-community
agreement.


> > (3) document type declaration - external DTD
> >    Remove external DTDs.
> >    Not complete consensus on what to do with entities.
> 
> I say predefine all the HTML5 named character names that end with a
> semicolon. (Except in XML, you wouldn't consider the trailing
> semicolon part of the name.)

Many XML vocabularies today come with a different set of predefined
entities; the HTML ones were based on a subset of an early version of
a larger SGML set.  I do know people who use many more, as well as
things like &publicationDay; or &productName;

 Another possible issue is that the names are in English.

But, the XHTML and MathML list is a good starting point for discussion.


> > (4) internal subset
[...]
> For XML5, I'd like to get rid of internal subset processing. The main
> problem is that existing XML content on the Web includes SVG files
> written by Adobe Illustrator, and those files not only have an
> internal subset but define namespace URLs as entities there and later
> use those entities in namespace declarations. (I'd be interested in
> knowing who at Adobe thought this was a good idea.)
I have no clue!

> 
> The fear of getting dragged into implementing internal subset
> processing is probably the main reason why I haven't written an XML5
> parser, yet. In SGML and in SGML-inspired languages, the number of
> tokenizer states required for a piece of syntax is inversely
> proportional to the usefulness of the piece of syntax. :-(

I didn't list some of the less useful XML features that I personally
would get rid of - e.g. NOTATION and NDATA entities.

It's possible that an XML 2 could have a cleaner, simpler syntax for an
internal subset. E.g. an xml-instance syntax,
<xml version="2.0">
  <head>
    <entity name="product">Product 3.1</entity>
  </head>
  <body>We're ready to ship &product; now!</body>
</xml>

where xml/head and xml/body are reserved names.

I don't know.

> > (5) multiple root elements
> >    Allow multiple root elements in a document.
> >    Why? Because people want it. There's no technical need.
> >    On the other hand, it may break existing APIs and tools.
> >    Seems to be weak consensus on doing this one.
> 
> Seems like a recipe for severe API incompatibility.
Yes. But it's been mentioned, so I listed it ;-)

> 
> > (6) Lax syntax and error recovery
> >    There's strong demand to allow processors to do error recovery,
> >    from some user communities.  This mostly seems to me to be
> >    Web browser programmers who deal with faulty RSS a lot; on the
> >    other hand, e.g. SOAP people would fight hard to keep this out
> >    (and it's certainly not a feature of JavaScript or JSON either).
> >    Not clear consensus here.
> 
> Making a new version of XML and making it Draconian *again* would truly be tragic.
Or, making a new version of XML and losing its best advantage over HTML
would be truly tragic :-) it depends who you ask.

> 
> > (7) Minimization
> >    This overlaps with No. 6, lax syntax.  Many people want to use
> >    a terser syntax, or have it as an option.  There is not (yet)
> >    strong consensus on what that should be.  Some people want
> >    <e>....</> or <e/..../ as per SGML. But there is not strong
> >    support for the exact SGML OMITTAG rules I think (which are
> >    complex and require a DTD)
> > 
> >    Neither is there support for DATATAG or the other SGML features
> >    exactly, but there do seem to be people who want some sort of
> >    terser markup.
> > 
> >    There has even been a LISP-like syntax suggested.
> >    The counter-arguments are usually simplicity and robustness.
> >    Not yet consenus.
> 
> FWIW, you can't have this *and* also have convergence between XML and HTML.

Convergence between XML and HTML hasn't been a strong topic on the list,
though. Should it be?

When I worked at SoftQuad, SGML minimization was a significant part of
our technical support costs - maybe as high as 80% at times - because
people would call up and say, "I've got this 5 megabyte SGML document
and Author/Editor won't open it, it says, invalid content after the
end of the document, or, mismatched tags, and it all looks fine to me".
But that doesn't really mean terse syntax was at fault so much as the
SGML rule, "if you see a tag you didn't expect, close elements until
you get a match".  The HTML 5 rules are laxer, although HTML has the
advantage that you can look and see if you're OK with the result; XML
tends to be used in more program-like ways, where you really need
rather more certainty. So again it's a different-user-community thing,
and the question is how to support both sets of usages, or whether it's
better to have XML do one and not the other.

> 
> > What is the business case here?
> 
> That's indeed the big question. At TPAC, TimBL said on stage (roughly,
> not exact quote) that XML is used too much in the enterprise for XML
> to change.
I missed his talk, but I think that's true, we're very limited in what
we can do. On the other hand, all bets
are off if it's a 2.0 -- except, as you noted, it's not clear that
enough people would want it.

Thanks for your comments!

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org www.advogato.org
References:
- XML 2 so far
  - From: Liam R E Quin <liam@w3.org>
- Re: [xml-dev] XML 2 so far
  - From: Henri Sivonen <hsivonen@iki.fi>
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]