Lists Home |
Date Index |
There is a lot of confusion arising from sentences that use "character"
without clarifying whether it is the glyph, collatable unit, Unicode code,
or UTF-* code that is meant. The confusion is one of wrong expectations.
You are right that a Java Character is a UTF-16 code. But making
Java Characters into 24-or 32-bit codes would still not make them
characters in the plain English sense (which is closest to "collatable
units"). A combining umlaut is not really a character for example;
radicals are not ndependent characters, though they may have codepoints.
So, paradoxically, an API that handles real characters properly
probably never has arguments or return results of Character
(or something that is 8, 16, 24, or 32 bits) but instead uses String
(and its variants).
One reason I like normalization is that it removes as many combining
character sequences as possible: making Java Character = collatable
unit more. So surrogates may add a level of handling for Java Characters,
but they don't add any more complexity for Java characters (taken
in the plain English/collatable unit sense).
The semantics of String and Character in the Java documentation
may indeed need to be updated now that Unicode goes beyond BMP
(reminiscent of the banks of the Rhine overflowing in Wagner.)
Probably the Java documentation should be proofread to check that
"character" is never used when "Character" is meant. But it is not
that the length of a String is no longer reliably the number of characters:
it never was--it is the number of Characters, to labour my point.
Anyway, my post was that Java, one of the leaders of the pack, is still
catching up, not that it has arrived at Unicode 4.0. Hence as long
as XML 1.1 is out there in PR for any early implementers who need it,
the critical path for getting better Unicode 3.2+ support in XML is
not an XML 1.1 REC (nor even IRIs) but API/platform infrastructure.
That is why I suggested that XML 1.1 is not urgent.
I think Tim read my "only just reaching" as "has already reached"
which is not what I intended to say.