[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Java/Unicode brain damage
- From: Tim Bray <email@example.com>
- To: firstname.lastname@example.org
- Date: Thu, 26 Jul 2001 21:11:08 -0700
At 08:24 PM 26/07/01 -0700, David Brownell wrote:
>> A Java 'char' is a 16 bit data type, so it simply isn't possible for
>> it to directly represent a Unicode character.
>Could you elaborate? There's a section in my Unicode book
>(in another city :) that talks about surrogates. There's a sense
>in which "if it's listed there, it's a kind of character".
>The word "character" is heavily overloaded, but I think it's
>clear that in at least one sense a Java "char" _is_ what folk
>call a "character". That's just how the word is used, even
>if it's arguably sloppy usage for other contexts.
>It would likely be instructive to have someone explain
>the senses in which "char" is, and isn't, a character.
It is clear that a Java "jchar" (hereinafter jchar) cannot
represent an XML character (xchar), simply because a jchar
can be in the surrogate range and an XML character can't;
also because a jchar can't represent a value outside of
the BMP, but such values are legal xchars.
As for combiners and so on, XML and Java agree that
COMBINING ACUTE ACCENT and so on are characters - yes,
there's a problem in that there are multiple ways to
represent things that will render identically, that's
why the W3C published a canonical character composition
I think it's clear that a jchar can represent a UTF-16
encoding unit, but java currently doesn't know about
the semantics associated with surrogates, i.e. they
have to appear in pairs which represent non-BMP chars.
I think I still believe that a jchar is really trying
to represent UCS-2.
>ISO-10646 code points
>are (as I understand) not necessarily going to be able
>to represent a "character" either (32 bits v. 16).
Well, an xchar is by definition a Unicode/ISO10646 code
point (hereinafter uchar). Yes, there are things that
a typographer would consider a "character" that can't be
represented in a single xchar or uchar. But damn few
actually, there are uchars for pretty well anything
you're apt to encounter outside the domain of bleeding-
edge math research.
The worrying thing is that for 99.9999999999% of all
real-world XML processing, if you pretend that a jchar
represents an xchar, you won't get in any trouble. So
I bet there's a huge amount of java code out there right
now that makes this assumption. I don't think we have
much understanding now as to what flavor of breakage is
apt to occur when (if) non-BMP data starts flowing
through such code. -Tim