[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Blueberry/Unicode/XML
- From: Tim Bray <tbray@textuality.com>
- To: xml-dev@lists.xml.org
- Date: Mon, 09 Jul 2001 21:33:12 -0700
Boy, this one's tough. I buy neither Elliote's assertion that
changing XML is unthinkable, nor John Cowan's assertion that the
depth of the cultural affront to users of pre-Unicode-3.1
languages is so high as to outweigh consideration of cost.
I just went and reviewed the Blueberry requirements at
http://www.w3.org/TR/xml-blueberry-req and I'm not very comfy
with them. There is repeated and specific reference to the
problem being that posed by Unicode 3.1. The problem isn't
3.1, it's that Unicode is an unfinished standard that
continues to grow actively, whereas it would be nice if
we could declare XML syntax finished and go back to our
plows.
XML 1.0 took a design decision in favor of enumeration of
name characters, simply because the alternative - outsourcing
the problem to the Unicode/ISO10646 process - had two
problems:
(a) We didn't know them well enough to trust them, and
(b) writing a satisfying set of rules for XML name chars
based solely on Unicode metadata is pretty hard.
The force of argument (b) is unabated. (a) seems less of
a worry now simply because the Unicode and XML gangs have
gotten pretty comfy with each other. But I do have a worry
at the back of my mind whether the W3C *institutionally*
ought to trust the consortium *institutionally* with
something of this magnitude. And what happens of ISO and
Unicode stop getting along one of these centuries, whose
side is XML on?
A few weeks ago, I was in favor of leaving it the way it
is, but only by about 55-45. I found the most convincing
argument on the other side was the person who postulated
a Khmer user typing away in emacs and having a disconnect
because there are lots of characters they can use for
people's names but not as attribute names. On the other
hand, this problem is not unique to Khmer - just ask
Mr. O'Hara.
And the notion of having a single monolithic XML whose
interoperability, while not perfect, is pretty $#!%* good,
partially based on those unwieldy character-class
productions, is something that it will hurt to lose. And
it is a reasonable position to say "The markup name character
class snapshot was based on Unicode 2.0, sorry 'bout that."
Realistically, there are 3 options:
1. Leave it the way it is.
2. Do Blueberry and then repeat the process for Unicode 3.2
and 4.0 and so on every couple of years forever.
3. Bite the bullet, write the rules in terms of Unicode
metadata and go to a pure use-by-reference architecture,
probably adding a syntactic signal to reference the
Unicode version number.
I think (3.) will prove to be really hard to do well - and
then the Unicode metadata fields might get changed and screw
it all up. I think (2.) is not unreasonable, but has the
institutional disadvantage that the XML standardization effort
has to become an ongoing process ad infinitum.
I still go for (1.). My opposition to NEL has hardened,
because of a strong fear that this one will cause real
wreckage on a widespread basis, not just in linguistic
corner cases.
But I really can't see how anyone can get behind any of
these positions and feel entirely comfortable with where
they find themselves standing. I sure don't. -Tim