[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: XML Blueberry
- From: Tim Bray <tbray@textuality.com>
- To: xml-dev@lists.xml.org
- Date: Thu, 21 Jun 2001 23:02:42 -0700
This Blueberry issue is not a slam-dunk either way, it's
a genuinely hard issue. Actually it's several genuinely
hard issues in an unattractive package, namely:
- the NEL character as a line separator
- the proper relationship to Unicode
- how to version XML
At the moment, based on a certain amount of introductory
thought, but not by an overwhelming margin, I lean to
doing nothing; simply because the cost of the Blueberry
action seems to outweigh its benefits.
Benefits first: I think that the specific Blueberry
suggestions (NEL and the new Name characters) are probably
technically correct from any sensible reading of
Unicode/ISO10646. Per Unicode, NEL is a first-class
line delimiter, at least equal in status to CR and NL, and
arguably superior since it's a single character with a clear
semantic, not a holdover from archaic typewriter-cylinder
control characters.
Secondly, the benefit is significant. The IBM mainframe
people followed all the rules in adopting NEL as a line
separator in their mainstream software libraries, and if
XML doesn't change it means you can either use standard
software text-handling tools OR you can use XML, but not
both. OS/MVS and its successors are hardly hip or
fashionable, but they serve as the stewards of remarkably
huge amounts of high-quality data, and any move to enable
the XMLification of this stuff is praiseworthy.
But the costs seem pretty darn high to me. If Blueberry
is adopted and is given a new <?xml version number, this
means that the mass of already-deployed XML software will
correctly throw such data on the ground, at some considerable
cost to interoperability. If there is no <?xml version
number, then such software will try to read it but then
unpredictably throw it on the ground upon encountering the
first NEL that appears inside a tag - or the first
element-type/attribute-name using one of the non-XML-1.0
name characters. Of these two, the second problem seems more
damaging, so I'd argue pretty strongly for signaling
Blueberry documents with some value of <?xml version="X" ?>
where X is not "1.0".
And the cost of this is very high. At the moment, XML 1.0
is pretty effectively one thing and just one thing, and if
I claim to ship out XML and you claim to be able to read it,
we can usually interoperate, especially since we're both
probably using expat or xerces or msxml. Introducing
Blueberry will impair this admirable simplicity.
A subsidiary issue, by the way: If you add NEL to the
set of line-end characters there are a bunch of other
Unicode space and space-like characters that, to be fair,
you're going to have to consider adding to the production
for "S".
And a possibly minor point: at the moment, all the "syntax"
characters in XML (<, >, /, =, &, ;, [, ], ', and ") are
in the one-byte Unicode range 0-127 which does enable some
sneaky parser construction tricks - probably not a big deal
though.
Then another potential problem: if you decide to push XML
past version 1.0, why not take the opportunity to pour
in namespaces? And fix the white-space handling? And...
well, probably nobody is willing to step off this cliff,
so maybe I'm raising a red herring.
The final issue, and I'm not sure whether it's a problem
or an opportunity, is the nature of the relationship between
XML and Unicode. In XML 1.0 1st & 2nd editions, it is clear
that all Unicode characters (except for a few low-valued
control characters, sigh) are legal in XML text, and then
there's this exhaustive enumeration of the characters that
are legal in XML names. This causes problems in two areas:
how to keep up with changing versions of Unicode, and how
to justify XML's private-label collection of Name characters?
One could just outsource the problem to Unicode and say
"a name character is what Unicode says", but XML 1.0 decided
not to do this, after exhaustive consideration (most of which
I've thankfully forgotten) and I've never heard a really
powerful (either in conviction or logic) argument that this
choice was wrong. There's a coherent but rather ad-hoc
(and non-normative) explanation of the choice of XML name
characters in one of the appendices. I believe John Cowan
suggested that he has a better algorithm/heuristic? Please
share. Having a cleaner simpler relationship between XML and
Unicode is arguably a good thing.
To summarize: The lack of ability to support standard
mainframe software and certain language groups' characters
in markup, while regrettable, is a problem whose cost is a
judgement call. It is possible and reasonable to compare
this cost with the cost described above of bifurcating XML
3 years into its life (another judgement call), and make a
third judgement call as to the relative magnitude of those
costs.
To cast it in the starkest possible light: Is it a
reasonable trade-off to say that we will live with
an incorrect interpretation of Unicode in certain specific
areas, with the consequences of complicating the lives of
mainframe users and impoverishing the tools available to
worthy users of certain minority languages, to achieve
the benefit of keeping XML monolithic and unitary? Yes,
it's reasonable. I might be convinced that it's wrong,
but it's a reasonable argument that needs to be addressed.
Corollary: it's not enough simply to say "Blueberry is
more correct per Unicode thus we have to do it, end of
debate."
So I think it would be appropriate, in this discussion,
to have some people in the mainframe trenches give us
a briefing on the scale and the difficulty of the problems
they face, and for some of our i18n gurus to highlight
the problems faced by an XML language designer who wants
to use one of the newly-added languages.
On the other side, we should consider the practicalities
and costs of upgrading (or not) the installed base in the
face of the deployment of data encoded in XML Blueberry.
I.e., let's keep this pragmatic.
Pardon the length, I was sitting in SFO with an hour to
kill. -Tim