[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Java/Unicode brain damage
- From: Miles Sabin <firstname.lastname@example.org>
- To: email@example.com
- Date: Fri, 27 Jul 2001 09:17:34 +0100
David Brownell wrote,
> Miles Sabin wrote,
> > A Java 'char' is a 16 bit data type, so it simply isn't possible
> > for it to directly represent a Unicode character.
> Could you elaborate?
[I'll use Tims 'jchar' and 'uchar']
Tim and Johns replies are exactly right as far as a single jchar is
concerned: a single jchar in isolation can't represent uchars outside
the BMP, and it can represent non-uchars (eg. surrogate values).
But of course jchars often don't appear in isolation. In chars and
in java.lang.Strings they appear in sequences, and in those cases
pairs of adjacent jchars can represent non-BMP uchars. Pairs of jchars
can also represent all sorts of other nonsense too, but that's not
necessarily a problem unless you absolutely insist that semantic
constraints be enforced programatically.
> The word "character" is heavily overloaded, but I think it's clear
> that in at least one sense a Java "char" _is_ what folk call a
> "character". That's just how the word is used, even if it's
> arguably sloppy usage for other contexts.
> It would likely be instructive to have someone explain the senses in
> which "char" is, and isn't, a character.
I don't think that can be done. A jchar is a 16 bit unsigned scalar.
It's association with a uchar is pretty much conventional, although
that association is almost always made. There's no way of telling from
just the syntax of a Java program whether or not a jchar (or jbyte, or
jint, or anything else for that matter) is or isn't being used to
represent a uchar. To tell that you have to know what the program
So I think it boils down to this: a jchar is a 16 bit unsigned scalar
which is typically appropriate for representing a BMP uchar; and jchar
sequences are typically appropriate for representing uchar sequences.
With the proviso that some jchars (resp. jchar sequences) don't
represent legal uchars (legal uchar sequences).
Oh, I guess I should point out that the above is my view, and doesn't
necessarily represent that of the JSR 51 EG (or anyone else, for that
Miles Sabin InterX
Internet Systems Architect 27 Great West Road
+44 (0)20 8817 4030 Middx, TW8 9AS, UK