[
Lists Home |
Date Index |
Thread Index
]
Rick Jelliffe scripsit:
> They have *almost* been abstracted away: a Java "character" is UTF-16.
> Some Unicode characters require more than one Java "character" to
> represent then. All *implementations* of characters have one (or more)
> underlying encoding. A nominal getEncoding() method on a Java 1.n
> character stream even TeeWriter should always produce "UTF-16".
Well, if you like. But *diversity* of encodings is lost.
> This should upset no-one, because some real characters may require
> more than one Unicode "character" to represent them, anyway.
> Take Vietnamese, please: if I have a u with a horn accent above plus
> a dot underneath [1], that is one real character (according to what
> people think of as characters) but three Unicode characters, 3 UTF-16
> characters, 6 bytes of storage.
Actually, you can also represent any Vietnamese letter with a single
Unicode (and UTF-16) character, U+1EF1 in this case.
The story with Vietnamese, for those who are curious, is that it has 12
vowel letters (a e i o u y a-breve a-circ e-circ o-circ o-horn u-horn),
each of which may bear one of five tone marks (acute, grave, hook above,
tilde, dot below).
--
It was impossible to inveigle John Cowan <jcowan@reutershealth.com>
Georg Wilhelm Friedrich Hegel http://www.ccil.org/~cowan
Into offering the slightest apology http://www.reutershealth.com
For his Phenomenology. --W. H. Auden, from "People" (1953)
|