[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Java/Unicode brain damage
- From: Miles Sabin <msabin@interx.com>
- To: xml-dev@lists.xml.org
- Date: Fri, 27 Jul 2001 09:17:34 +0100
David Brownell wrote,
> Miles Sabin wrote,
> > A Java 'char' is a 16 bit data type, so it simply isn't possible
> > for it to directly represent a Unicode character.
>
> Could you elaborate?
[I'll use Tims 'jchar' and 'uchar']
Tim and Johns replies are exactly right as far as a single jchar is
concerned: a single jchar in isolation can't represent uchars outside
the BMP, and it can represent non-uchars (eg. surrogate values).
But of course jchars often don't appear in isolation. In char[]s and
in java.lang.Strings they appear in sequences, and in those cases
pairs of adjacent jchars can represent non-BMP uchars. Pairs of jchars
can also represent all sorts of other nonsense too, but that's not
necessarily a problem unless you absolutely insist that semantic
constraints be enforced programatically.
> The word "character" is heavily overloaded, but I think it's clear
> that in at least one sense a Java "char" _is_ what folk call a
> "character". That's just how the word is used, even if it's
> arguably sloppy usage for other contexts.
>
> It would likely be instructive to have someone explain the senses in
> which "char" is, and isn't, a character.
I don't think that can be done. A jchar is a 16 bit unsigned scalar.
It's association with a uchar is pretty much conventional, although
that association is almost always made. There's no way of telling from
just the syntax of a Java program whether or not a jchar (or jbyte, or
jint, or anything else for that matter) is or isn't being used to
represent a uchar. To tell that you have to know what the program
means.
So I think it boils down to this: a jchar is a 16 bit unsigned scalar
which is typically appropriate for representing a BMP uchar; and jchar
sequences are typically appropriate for representing uchar sequences.
With the proviso that some jchars (resp. jchar sequences) don't
represent legal uchars (legal uchar sequences).
Oh, I guess I should point out that the above is my view, and doesn't
necessarily represent that of the JSR 51 EG (or anyone else, for that
matter ;-)
Cheers,
Miles
--
Miles Sabin InterX
Internet Systems Architect 27 Great West Road
+44 (0)20 8817 4030 Middx, TW8 9AS, UK
msabin@interx.com http://www.interx.com/