OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XML Blueberry

This Blueberry issue is not a slam-dunk either way, it's
a genuinely hard issue.  Actually it's several genuinely
hard issues in an unattractive package, namely:

 - the NEL character as a line separator
 - the proper relationship to Unicode
 - how to version XML

At the moment, based on a certain amount of introductory 
thought, but not by an overwhelming margin, I lean to 
doing nothing; simply because the cost of the Blueberry 
action seems to outweigh its benefits.  

Benefits first:  I think that the specific Blueberry 
suggestions (NEL and the new Name characters) are probably 
technically correct from any sensible reading of 
Unicode/ISO10646.  Per Unicode, NEL is a first-class
line delimiter, at least equal in status to CR and NL, and 
arguably superior since it's a single character with a clear 
semantic, not a holdover from archaic typewriter-cylinder 
control characters.

Secondly, the benefit is significant.  The IBM mainframe
people followed all the rules in adopting NEL as a line
separator in their mainstream software libraries, and if
XML doesn't change it means you can either use standard
software text-handling tools OR you can use XML, but not
both.  OS/MVS and its successors are hardly hip or 
fashionable, but they serve as the stewards of remarkably
huge amounts of high-quality data, and any move to enable
the XMLification of this stuff is praiseworthy.

But the costs seem pretty darn high to me.  If Blueberry
is adopted and is given a new <?xml version number, this 
means that the mass of already-deployed XML software will 
correctly throw such data on the ground, at some considerable 
cost to interoperability.  If there is no <?xml version 
number, then such software will try to read it but then 
unpredictably throw it on the ground upon encountering the 
first NEL that appears inside a tag - or the first 
element-type/attribute-name using one of the non-XML-1.0 
name characters.  Of these two, the second problem seems more 
damaging, so I'd argue pretty strongly for signaling 
Blueberry documents with some value of <?xml version="X" ?> 
where X is not "1.0".

And the cost of this is very high.  At the moment, XML 1.0
is pretty effectively one thing and just one thing, and if
I claim to ship out XML and you claim to be able to read it,
we can usually interoperate, especially since we're both
probably using expat or xerces or msxml.  Introducing 
Blueberry will impair this admirable simplicity.

A subsidiary issue, by the way:  If you add NEL to the 
set of line-end characters there are a bunch of other 
Unicode space and space-like characters that, to be fair,
you're going to have to consider adding to the production
for "S". 

And a possibly minor point: at the moment, all the "syntax"
characters in XML (<, >, /, =, &, ;, [, ], ', and ") are 
in the one-byte Unicode range 0-127 which does enable some 
sneaky parser construction tricks - probably not a big deal

Then another potential problem: if you decide to push XML
past version 1.0, why not take the opportunity to pour 
in namespaces?  And fix the white-space handling?  And...
well, probably nobody is willing to step off this cliff,
so maybe I'm raising a red herring.

The final issue, and I'm not sure whether it's a problem
or an opportunity, is the nature of the relationship between
XML and Unicode.  In XML 1.0 1st & 2nd editions, it is clear
that all Unicode characters (except for a few low-valued
control characters, sigh) are legal in XML text, and then
there's this exhaustive enumeration of the characters that
are legal in XML names.  This causes problems in two areas:
how to keep up with changing versions of Unicode, and how
to justify XML's private-label collection of Name characters?

One could just outsource the problem to Unicode and say 
"a name character is what Unicode says", but XML 1.0 decided
not to do this, after exhaustive consideration (most of which
I've thankfully forgotten) and I've never heard a really
powerful (either in conviction or logic) argument that this
choice was wrong.  There's a coherent but rather ad-hoc
(and non-normative) explanation of the choice of XML name 
characters in one of the appendices.  I believe John Cowan
suggested that he has a better algorithm/heuristic?  Please
share. Having a cleaner simpler relationship between XML and 
Unicode is arguably a good thing.  

To summarize: The lack of ability to support standard 
mainframe software and certain language groups' characters 
in markup, while regrettable, is a problem whose cost is a 
judgement call.  It is possible and reasonable to compare 
this cost with the cost described above of bifurcating XML 
3 years into its life (another judgement call), and make a 
third judgement call as to the relative magnitude of those 

To cast it in the starkest possible light: Is it a 
reasonable trade-off to say that we will live with 
an incorrect interpretation of Unicode in certain specific
areas, with the consequences of complicating the lives of
mainframe users and impoverishing the tools available to
worthy users of certain minority languages, to achieve 
the benefit of keeping XML monolithic and unitary?  Yes,
it's reasonable.  I might be convinced that it's wrong,
but it's a reasonable argument that needs to be addressed.
Corollary: it's not enough simply to say "Blueberry is 
more correct per Unicode thus we have to do it, end of

So I think it would be appropriate, in this discussion,
to have some people in the mainframe trenches give us
a briefing on the scale and the difficulty of the problems
they face, and for some of our i18n gurus to highlight
the problems faced by an XML language designer who wants
to use one of the newly-added languages.

On the other side, we should consider the practicalities
and costs of upgrading (or not) the installed base in the
face of the deployment of data encoded in XML Blueberry.

I.e., let's keep this pragmatic. 

Pardon the length, I was sitting in SFO with an hour to
kill. -Tim