Lists Home |
Date Index |
From: "Tim Bray" <firstname.lastname@example.org>
> The point about doing strings, not characters, is well-taken, and one of
> the things in the W3C i18n draft that gave me an "aha" moment. On the
> other hand, I think that when I say a "Unicode character", that has a
> very well-defined semantic, and COMBINING UMLAUT is one while codepoints
> from the surrogate blocks aren't, and any API that doesn't make that
> clear is, well, wrong. Put another way, something that is a Unicode
> character in UTF-16 should also be a character in UTF-8 and UTF-32,
> which the surrogates aren't, so they are just not characters in any
> meaningful sense of the word.
I'm puzzled. What is the "aha moment" here? Your point seems to be that Java
char != Unicode character. True. Exactly like UTF-8 octet != Unicode
character. The fact that half a surrogate pair is not a Unicode character
doesn't seem like breaking news.
Do you mean to say that use of UTF-16 character encoding in a programming
language is broken as designed? In the perfect language of your own design,
would you have the "char" type be 32 bits? Is that what this is all about?