[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Java/Unicode brain damage

From: John Cowan <cowan@mercury.ccil.org>
To: David Brownell <david-b@pacbell.net>
Date: Fri, 27 Jul 2001 00:17:40 -0400 (EDT)

David Brownell scripsit:

> It would likely be instructive to have someone explain
> the senses in which "char" is, and isn't, a character.

A Java char is a 16-bit unsigned integral value.  Unicode characters require
21 bits of unsigned integer to fully represent them.  UTF-16 is a
representation scheme in which the Unicode characters with values
between 0 and D7FF or between E000 and FFFF, are represented by
a single 16-bit value, and the rest are represented by two
consecutive 16-bit values, one ranging from D800 to DBFF and the
other ranging from DC00 to DFFF.

Fortunately, all the commonly used Unicode characters are of the
first kind.

> Likewise the senses in which combining marks relate
> to the concept of a character ... "character" is actually
> a rather complex notion, and ISO-10646 code points
> are (as I understand) not necessarily going to be able
> to represent a "character" either (32 bits v. 16).

Indeed, "characters" in this sense (often called "graphemes"
by Unicode people, though a better term is sought) can
contain arbitrarily long strings of Unicode characters:

In European scripts, a base letter may be followed by up to
three diacritics in practice, and in theory there is no
limit at all;

Korean syllables are composed of up to three letters;

Indic syllables can have any number of basic letters
separated by viramas and non-joiner, followed by
a vowel sign and possibly a diacritic;

Tibetan script is much the same, except that the
consonants after the first are represented by
separate "subjoined" letters, so no virama is needed.

-- 
John Cowan                                   cowan@ccil.org
One art/there is/no less/no more/All things/to do/with sparks/galore
	--Douglas Hofstadter

References:
- Re: Java/Unicode brain damage
  - From: David Brownell <david-b@pacbell.net>

Prev by Date: Re: Java/Unicode brain damage
Next by Date: correct behaviour for entities as system identifiers
Previous by thread: Re: Java/Unicode brain damage
Next by thread: RE: Java/Unicode brain damage
Index(es):
- Date
- Thread