[
Lists Home |
Date Index |
Thread Index
]
Elliotte Rusty Harold wrote,
> At 7:42 AM -0700 4/29/03, Tim Bray wrote:
> > Really? I just looked at a recent set of Java docs, and it's pretty
> > clear that a Java char isn't really a character, it's a UTF-16
> > codepoint, and the semantics of String are wrong for non-BMP
> > characters, and that the attempt at UTF-8 support remains pretty
> > laughably nonstandard and wrong. I'd be *delighted* to hear that
> > I'm looking at wrong/obsolete docs. Pointers anyone? -Tim
>
> Unfortunately, you're more than half right. The InputStreamReader and
> OutputStreamWriter classes do handle UTF-8 correctly. The readUTF and
> writeUTF methods in DataInputStream/DataOutputStream don't. This
> wouldn't be a problem if they were simply called readString/
> writeString instead.
Yup, that's right ... for all intents and purposes, readUTF and writeUTF
should be treated as specifying a non-standard encoding solely for the
use of Java RMI.
> However, your comments about the char types are dead on.
They're dead on, but unhelpful. There's really nothing that can be done
right now which wouldn't break an awful lot of existing code. At least
redesignating Java chars as UTF-16 units is honest.
If we're lucky, the output of this,
http://www.jcp.org/en/jsr/detail?id=204
might help in the not too distant future.
Cheers,
Miles
|