[
Lists Home |
Date Index |
Thread Index
]
Tim Bray scripsit:
> Really? I just looked at a recent set of Java docs, and it's pretty
> clear that a Java char isn't really a character, it's a UTF-16
> codepoint, and the semantics of String are wrong for non-BMP characters,
> and that the attempt at UTF-8 support remains pretty laughably
> nonstandard and wrong. I'd be *delighted* to hear that I'm looking at
> wrong/obsolete docs. Pointers anyone? -Tim
It's true that Java chars are UTF-16 codepoints; changing that would be
nothing less than revolutionary. I don't understand what's wrong with
the semantics of String, unless you mean that it's indexed by UTF-16
codepoints, which *is* what you are going to have 99% of the time.
ICU/J provides correct-but-less-efficient indexing for when you need it.
As for UTF-8, that's a canard. The methods DataOutputStream.writeUTF and
DataInputStream.readUTF have nothing to do with UTF-8 text transport:
they are *binary* methods that write and read a 16-bit byte length
followed by modified UTF-8 (no 0x00 bytes). You use those only if you
are doing roll-your-own binary serialization. The actual UTF-8 support
is in InputStreamReader and OutputStreamWriter and is entirely compliant.
--
John Cowan <jcowan@reutershealth.com> www.ccil.org/~cowan www.reutershealth.com
Micropayment advocates mistakenly believe that efficient allocation of
resources is the purpose of markets. Efficiency is a byproduct of market
systems, not their goal. The reasons markets work are not because users
have embraced efficiency but because markets are the best place to allow
users to maximize their preferences, and very often their preferences are
not for conservation of cheap resources. --Clay Shirkey
|