[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Java/Unicode brain damage
- From: John Cowan <cowan@mercury.ccil.org>
- To: David Brownell <david-b@pacbell.net>
- Date: Fri, 27 Jul 2001 00:17:40 -0400 (EDT)
David Brownell scripsit:
> It would likely be instructive to have someone explain
> the senses in which "char" is, and isn't, a character.
A Java char is a 16-bit unsigned integral value. Unicode characters require
21 bits of unsigned integer to fully represent them. UTF-16 is a
representation scheme in which the Unicode characters with values
between 0 and D7FF or between E000 and FFFF, are represented by
a single 16-bit value, and the rest are represented by two
consecutive 16-bit values, one ranging from D800 to DBFF and the
other ranging from DC00 to DFFF.
Fortunately, all the commonly used Unicode characters are of the
first kind.
> Likewise the senses in which combining marks relate
> to the concept of a character ... "character" is actually
> a rather complex notion, and ISO-10646 code points
> are (as I understand) not necessarily going to be able
> to represent a "character" either (32 bits v. 16).
Indeed, "characters" in this sense (often called "graphemes"
by Unicode people, though a better term is sought) can
contain arbitrarily long strings of Unicode characters:
In European scripts, a base letter may be followed by up to
three diacritics in practice, and in theory there is no
limit at all;
Korean syllables are composed of up to three letters;
Indic syllables can have any number of basic letters
separated by viramas and non-joiner, followed by
a vowel sign and possibly a diacritic;
Tibetan script is much the same, except that the
consonants after the first are represented by
separate "subjoined" letters, so no virama is needed.
--
John Cowan cowan@ccil.org
One art/there is/no less/no more/All things/to do/with sparks/galore
--Douglas Hofstadter