[
Lists Home |
Date Index |
Thread Index
]
On Aug 13, 2005, at 14:19, Alan Gutierrez wrote:
> Am I seeing that with Unicode in Java, you need to work with
> String and not with individual char? That puts a dent in my
> algorithm, which advanced along the characters in the string.
It depends on what exactly you are doing. A Java char is not a Unicode
character but a UTF-16 code unit. The values \u0000 and \uFFFF should
never occur in XML and can be used as sentinels if your algorithm works
on UTF-16 code units. For the purpose of indexing text, working on
UTF-16 code units as opposed to working on Unicode characters may well
be good enough. In that case, a surrogate pair can be treated as two
adjacent "characters". (Note that even when operating on UTF-32, you
can have tightly-coupled characters when there is a base character
followed by combining marks, so working on Unicode characters does not
buy you inter-character independence.)
--
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
|