OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Java/Unicode brain damage

David Brownell wrote,
> Miles Sabin wrote,
> > A Java 'char' is a 16 bit data type, so it simply isn't possible 
> > for it to directly represent a Unicode character. 
> Could you elaborate?

[I'll use Tims 'jchar' and 'uchar']

Tim and Johns replies are exactly right as far as a single jchar is 
concerned: a single jchar in isolation can't represent uchars outside 
the BMP, and it can represent non-uchars (eg. surrogate values).

But of course jchars often don't appear in isolation. In char[]s and
in java.lang.Strings they appear in sequences, and in those cases
pairs of adjacent jchars can represent non-BMP uchars. Pairs of jchars
can also represent all sorts of other nonsense too, but that's not
necessarily a problem unless you absolutely insist that semantic
constraints be enforced programatically.

> The word "character" is heavily overloaded, but I think it's clear 
> that in at least one sense a Java "char" _is_ what folk call a 
> "character".  That's just how the word is used, even if it's 
> arguably sloppy usage for other contexts.
> It would likely be instructive to have someone explain the senses in 
> which "char" is, and isn't, a character.

I don't think that can be done. A jchar is a 16 bit unsigned scalar.
It's association with a uchar is pretty much conventional, although
that association is almost always made. There's no way of telling from 
just the syntax of a Java program whether or not a jchar (or jbyte, or 
jint, or anything else for that matter) is or isn't being used to
represent a uchar. To tell that you have to know what the program

So I think it boils down to this: a jchar is a 16 bit unsigned scalar 
which is typically appropriate for representing a BMP uchar; and jchar 
sequences are typically appropriate for representing uchar sequences. 
With the proviso that some jchars (resp. jchar sequences) don't
represent legal uchars (legal uchar sequences).

Oh, I guess I should point out that the above is my view, and doesn't
necessarily represent that of the JSR 51 EG (or anyone else, for that
matter ;-)



Miles Sabin                                     InterX
Internet Systems Architect                      27 Great West Road
+44 (0)20 8817 4030                             Middx, TW8 9AS, UK
msabin@interx.com                               http://www.interx.com/