OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Java/Unicode brain damage

David Brownell scripsit:

> It would likely be instructive to have someone explain
> the senses in which "char" is, and isn't, a character.

A Java char is a 16-bit unsigned integral value.  Unicode characters require
21 bits of unsigned integer to fully represent them.  UTF-16 is a
representation scheme in which the Unicode characters with values
between 0 and D7FF or between E000 and FFFF, are represented by
a single 16-bit value, and the rest are represented by two
consecutive 16-bit values, one ranging from D800 to DBFF and the
other ranging from DC00 to DFFF.

Fortunately, all the commonly used Unicode characters are of the
first kind.

> Likewise the senses in which combining marks relate
> to the concept of a character ... "character" is actually
> a rather complex notion, and ISO-10646 code points
> are (as I understand) not necessarily going to be able
> to represent a "character" either (32 bits v. 16).

Indeed, "characters" in this sense (often called "graphemes"
by Unicode people, though a better term is sought) can
contain arbitrarily long strings of Unicode characters:

In European scripts, a base letter may be followed by up to
three diacritics in practice, and in theory there is no
limit at all;

Korean syllables are composed of up to three letters;

Indic syllables can have any number of basic letters
separated by viramas and non-joiner, followed by
a vowel sign and possibly a diacritic;

Tibetan script is much the same, except that the
consonants after the first are represented by
separate "subjoined" letters, so no virama is needed.

John Cowan                                   cowan@ccil.org
One art/there is/no less/no more/All things/to do/with sparks/galore
	--Douglas Hofstadter