[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Blueberry/Unicode/XML
- From: John Cowan <cowan@mercury.ccil.org>
- To: Tim Bray <tbray@textuality.com>
- Date: Tue, 10 Jul 2001 09:10:57 -0400 (EDT)
Tim Bray scripsit:
> The problem isn't
> 3.1, it's that Unicode is an unfinished standard that
> continues to grow actively, whereas it would be nice if
> we could declare XML syntax finished and go back to our
> plows.
It surely would, but that isn't the Real World. You will
note that one of the Requirements is for the Core WG to
consider the future evolution of Unicode.
> XML 1.0 took a design decision in favor of enumeration of
> name characters, simply because the alternative - outsourcing
> the problem to the Unicode/ISO10646 process - had two
> problems:
>
> (a) We didn't know them well enough to trust them, and
> (b) writing a satisfying set of rules for XML name chars
> based solely on Unicode metadata is pretty hard.
>
> The force of argument (b) is unabated.
Actually, it turns out to be pretty easy. The following isn't official,
but it's what I have in mind (and so far nobody has really poked holes
in it):
1. Basic name-start characters are Unicode classes Ll (lower case), Lu (upper
case), Lm (modifier letters), Lo (other letters, including ideographs), and
Nl (a handful of oddballs).
2. Basic name characters are the above plus Mn (non-spacing combining
marks), Mc (Indic vowels and the like), Nd (digits), and Pc (connective
punctuation like KATAKANA MIDDLE DOT).
These two rules constitute the Unicode 3.1 rules for "what is an identifier"
(except that Unicode allows invisible formatting characters that are
also invisible to name matching, a concept that doesn't fit XML),
so already XML and Unicode are in good alignment.
3. Exclude all compatibility characters, and all characters in
the Compatibility Zone (which are mostly, but not entirely, compatibility
characters) except the 12 IBM ideographs that aren't unifiable with
anything else. Unicode rules would leave these in, but only if loose matching
is allowed. With XML's strict name matching, they would just cause
hopeless confusion.
4. Add the XML-specific name-start characters colon and underscore, and
the XML-specific name characters hyphen, dot, and middle dot.
5. Finally, there are 21 characters (18 are name-start) that XML 1.0
included that aren't covered by these rules for a variety of reasons, so just
include them as a fixed list of exceptions. 21 out of 90,000+ isn't bad.
> And what happens if ISO and
> Unicode stop getting along one of these centuries, whose
> side is XML on?
Sooner the moon will fall from heaven!
> 1. Leave it the way it is.
> 2. Do Blueberry and then repeat the process for Unicode 3.2
> and 4.0 and so on every couple of years forever.
One thing to say about this is that the list of characters to be added
is shrinking all the time. Unicode 3.2 will add only 139 name
characters, of which less than 20 are actually used by modern
scripts. If we add another rule
6. Omit all characters from archaic scripts, as they have
no native users any more.
then the next change will be scarcely a ripple, affecting IIRC
only Ainu (a minority language of Japan that uses additional
katakana).
> I think (3.) will prove to be really hard to do well - and
> then the Unicode metadata fields might get changed and screw
> it all up.
Unicode has come a long way toward stabilizing the relevant
categories.
> But I really can't see how anyone can get behind any of
> these positions and feel entirely comfortable with where
> they find themselves standing. I sure don't. -Tim
a) Slippery slopes can get to be a habit, I guess.
b) It's a dirty job, but someone's got to do it.
--
John Cowan cowan@ccil.org
One art/there is/no less/no more/All things/to do/with sparks/galore
--Douglas Hofstadter