OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Java/Unicode brain damage

At 08:24 PM 26/07/01 -0700, David Brownell wrote:
>> A Java 'char' is a 16 bit data type, so it simply isn't possible for
>> it to directly represent a Unicode character. 
>Could you elaborate?  There's a section in my Unicode book
>(in another city :) that talks about surrogates.  There's a sense
>in which "if it's listed there, it's a kind of character".
>The word "character" is heavily overloaded, but I think it's
>clear that in at least one sense a Java "char" _is_ what folk
>call a "character".  That's just how the word is used, even
>if it's arguably sloppy usage for other contexts.
>It would likely be instructive to have someone explain
>the senses in which "char" is, and isn't, a character.

It is clear that a Java "jchar" (hereinafter jchar) cannot
represent an XML character (xchar), simply because a jchar
can be in the surrogate range and an XML character can't; 
also because a jchar can't represent a value outside of
the BMP, but such values are legal xchars.

As for combiners and so on, XML and Java agree that 
COMBINING ACUTE ACCENT and so on are characters - yes,
there's a problem in that there are multiple ways to
represent things that will render identically, that's
why the W3C published a canonical character composition

I think it's clear that a jchar can represent a UTF-16
encoding unit, but java currently doesn't know about
the semantics associated with surrogates, i.e. they
have to appear in pairs which represent non-BMP chars.
I think I still believe that a jchar is really trying
to represent UCS-2.

>ISO-10646 code points
>are (as I understand) not necessarily going to be able
>to represent a "character" either (32 bits v. 16).

Well, an xchar is by definition a Unicode/ISO10646 code
point (hereinafter uchar).  Yes, there are things that 
a typographer would consider a "character" that can't be 
represented in a single xchar or uchar.  But damn few
actually, there are uchars for pretty well anything 
you're apt to encounter outside the domain of bleeding-
edge math research.

The worrying thing is that for 99.9999999999% of all
real-world XML processing, if you pretend that a jchar
represents an xchar, you won't get in any trouble.  So
I bet there's a huge amount of java code out there right
now that makes this assumption.  I don't think we have
much understanding now as to what flavor of breakage is
apt to occur when (if) non-BMP data starts flowing 
through such code.  -Tim