Lists Home |
Date Index |
I think this XML 1.1 version is a big step forward from previous versions: the XML Core
WG has considerably toned down on their initial features, to the point where now XML 1.1
may well be better than XML 1.0.
Normalization is definitely a good thing. There should be more of it, especially
by other people:-) But currently we are not well-served by normalization libraries.
I use a stripped down version of ICU4J in a product for normalization: but the
off-the-shelf jars currently distributed for ICU4J are about 10 Meg. Unrealistically big.
So the XML 1.1 approach of saying normalization is good and may be checked for is
probably the most realistic approach. It allows natural movement in a positive
direction, like old underpants.
2) End-of-line handling
XML 1.1 takes the line-of-least resistance here as well: don't change the definition
of spaces (which would then have to propogate through other specs and technologies
that use XML tokens or S production), but allow a couple more name characters.
I have implemented this in a product, and it really is trivial to put in.
So the XML 1.1 approach does no harm but opens the door for people who say
they need NELs.
XML 1.1's new character production is, I think, a real step forward for XML.
It allows almost more kinds of characters to be sent, and so improves XML
for data exchange. But it also disallows controls from being sent directly
(numeric character references must be sent), which takes a good stand that
XML is a textual format: that a control character sent in the data stream
*is* a control character and not data content.
The main reason I think this new character rule is a big step forward is that
(as argued in http://www.topologi.com/public/XML_Naming_Rules.html )
the control characters, especially the C1 controls U+0080-U+009F,
are excellent for detecting encoding-labelling errors (robustness).
XML 1.0 provided meagre but useful encoding-labelling error-detection,
but the XML 1.1 rules will work on non-ASCII data, not just non-ASCII
markup. See also the sidebar "How could XML 1.1 help?" in the Euro
article at http://www.xml.com/pub/a/2002/09/18/euroxml.html for more info.
So the XML 1.1 character rules are a step forward for coverage, robustness
and XML as textual.
4) Name Characters
XML 1.1's new name rules stink, but not as much as they used to, and not
so much that I couldn't get used to them.
The objections I had raised to the previous draft rules were:
* They reduced encoding- error detection: but the new Character rules do this
better, so that objection has been met.
* They cannot be justified by being "Unicode-version independent" because
normalization-checking is Unicode-version dependent anyway: but the
earker normalization-checking requirement makes this objection lose force.
* They would allow line-breaks in bad places: the latest draft removes
many breaking characters (i.e., the space characters in the early U+2000s and
the ideographic space). I would prefer it went further...
* The earlier drafts did not pay adequate attention to XML as being textual:
the new control rules and the new whitespace rules for naming meet this
* The initial drafts seemed to downplay the importance of the basic readability
(not to be confused with comprehensibility!) of XML documents: since the
April draft they put in Appendix B, and based it on Unicode character classes
rather than enumeration, which I think is a better approach. But a stricter
application of these guidelines would have been better. On the other hand,
specs such as XML Schemas reference XML 1.0, so they provide a nice
bit of intertia to prevent crazy characters. And checking that names are
nice might be better done by another layer, such as a schema tool or
* Some characters simply do not have any pronunceation or common name,
in the language they are used: symbol characters and math characters for
example. Consequently, they represent a real barrier for accessability
(for programmers with impaired eye-sight for example): speech
synthesizers will typically remove unknown characters. I think
there is a strong difference between allowing an Ethopian character
(which could be pronounced) and a dingbat in XML Names: the former
affords communication when used appropriately, the latter blocks
So this XML 1.1 goes some way in meeting my previous objections.
It was not at all clear from previous drafts whether XML 1.1 required a new
infoset. It seems now that while it changes WF-ness of a document, it does
not change the XML infoset or require new Infoset spects. This is a good thing,
because it reduces cascading effects through XML-land.
So this XML 1.1 seems to meet my previous objection.
All in all, I congratulate the XML Core WG on this XML 1.1 draft, and all
the sensible compromises in it. It will be interesting to see whether it
Topologi, Pty. Ltd.
P.S., I think the reason that U+0000 is not allowed as an XML character
is that the standard C libraries (and maybe other libraries) cannot allow
nulls in strings. It is a sensible rule IMHO.