OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Quick Review of XML 1.1 Candidate Recommendation

[ Lists Home | Date Index | Thread Index ]

I think this XML 1.1 version is a big step forward from previous versions: the XML Core
WG has considerably toned down on their initial features, to the point where now XML 1.1
may well be better than XML 1.0.

1) Normalization 

Normalization is definitely a good thing. There should be more of it, especially
by other people:-) But currently we are not well-served by normalization libraries.
I use a stripped down version of ICU4J in a product for normalization: but the 
off-the-shelf jars currently distributed for ICU4J are about 10 Meg. Unrealistically big.  

So the XML 1.1 approach of saying normalization is good and may be checked for is
probably the most realistic approach. It allows natural movement in a positive
direction, like old underpants.

2) End-of-line handling

XML 1.1 takes the line-of-least resistance here as well: don't change the definition
of spaces (which would then have to propogate through other specs and technologies
that use XML tokens or S production), but allow a couple more name characters.
I have implemented this in a product, and it really is trivial to put in.

So the XML 1.1 approach does no harm but opens the door for people who say
they need NELs.  

3) Characters

XML 1.1's new character production is, I think, a real step forward for XML.
It allows almost more kinds of characters to be sent, and so improves XML
for data exchange.  But it also disallows controls from being sent directly
(numeric character references must be sent), which takes a good stand that
XML is a textual format: that a control character sent in the data stream
*is* a control character and not data content. 

The main reason I think this new character rule is a big step forward is that
(as argued in http://www.topologi.com/public/XML_Naming_Rules.html )
the control characters, especially the C1 controls U+0080-U+009F,
are excellent for detecting encoding-labelling errors (robustness).  

XML 1.0 provided meagre but useful encoding-labelling error-detection,
but the XML 1.1 rules will work on non-ASCII data, not just non-ASCII
markup.  See also the sidebar "How could XML 1.1 help?" in the Euro 
article at http://www.xml.com/pub/a/2002/09/18/euroxml.html  for more info.

So the XML 1.1 character rules are a step forward for coverage, robustness
and XML as textual. 

4) Name Characters

XML 1.1's new name rules stink, but not as much as they used to, and not
so much that I couldn't get used to them.  

The objections I had raised to the previous draft rules were:

 * They reduced encoding- error detection: but the new Character rules do this 
  better, so that objection has been met.
 * They cannot be justified by being "Unicode-version independent" because 
   normalization-checking is Unicode-version dependent anyway: but the
   earker normalization-checking requirement makes this objection lose force.
 * They would allow line-breaks in bad places: the latest draft removes 
  many breaking characters (i.e., the space characters in the early U+2000s and
  the ideographic space). I would prefer it went further...
 * The earlier drafts did not pay adequate attention to XML as being textual:
  the new control rules and the new whitespace rules for naming meet this
 * The initial drafts seemed to downplay the importance of the basic readability
  (not to be confused with comprehensibility!) of XML documents: since the 
  April draft they put in Appendix B, and based it on Unicode character classes
  rather than enumeration, which I think is a better approach.  But a stricter
   application of these guidelines would have been better. On the other hand,
   specs such as XML Schemas reference XML 1.0, so they provide a nice
   bit of intertia to prevent crazy characters.  And checking that names are
   nice might be better done by another layer, such as a schema tool or 
   editor, anyway. 
 * Some characters simply do not have any pronunceation or common name,
   in the language they are used: symbol characters and math characters for
   example.  Consequently, they represent a real barrier for accessability
  (for programmers with impaired eye-sight for example):  speech 
  synthesizers will typically remove unknown characters.  I think
  there is a strong difference between allowing an Ethopian character
  (which could be pronounced) and a dingbat in XML Names: the former
   affords communication when used appropriately, the latter blocks

So this XML 1.1 goes some way in meeting my previous objections.

5) Versions

It was not at all clear from previous drafts whether XML 1.1 required a new
infoset.  It seems now that while it changes WF-ness of a document, it does
not change the XML infoset or require new Infoset spects. This is a good thing, 
because it reduces cascading effects through XML-land.

So this XML 1.1 seems to meet my previous objection.

All in all, I congratulate the XML Core WG on this XML 1.1 draft, and all 
the sensible compromises in it.  It will be interesting to see whether it 
takes off. 

Rick Jelliffe
Topologi, Pty. Ltd.

P.S., I think the reason that U+0000 is not allowed as an XML character
is that the standard C libraries (and maybe other libraries) cannot allow
nulls in strings. It is a sensible rule IMHO.


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS