[
Lists Home |
Date Index |
Thread Index
]
From: "Tim Bray" <tbray@textuality.com>
> The point about doing strings, not characters, is well-taken, and one of
> the things in the W3C i18n draft that gave me an "aha" moment. On the
> other hand, I think that when I say a "Unicode character", that has a
> very well-defined semantic, and COMBINING UMLAUT is one while codepoints
> from the surrogate blocks aren't, and any API that doesn't make that
> clear is, well, wrong. Put another way, something that is a Unicode
> character in UTF-16 should also be a character in UTF-8 and UTF-32,
> which the surrogates aren't, so they are just not characters in any
> meaningful sense of the word.
I'm puzzled. What is the "aha moment" here? Your point seems to be that Java
char != Unicode character. True. Exactly like UTF-8 octet != Unicode
character. The fact that half a surrogate pair is not a Unicode character
doesn't seem like breaking news.
Do you mean to say that use of UTF-16 character encoding in a programming
language is broken as designed? In the perfect language of your own design,
would you have the "char" type be 32 bits? Is that what this is all about?
Bob
|