[
Lists Home |
Date Index |
Thread Index
]
From: "Richard Tobin" <richard@cogsci.ed.ac.uk>
> >The nasty fact is that
> >I suspect many Java application programmers will end up
> >simply blowing off non-BMP text either through ignorance
> >or based on a decision that it's not cost-effective.
>
> It depends what they want to do with it. Won't they just end up
> passing it through as pairs of surrogates?
And also, do surrogate pairs really introduce any issues that
are not already present in combining character sequences?
I have been going through this recently for our markup editor.
For the first version, we have decided to not-barf-but-not-
provide-support-for combining character sequences
or surrogates, because the 1 Java char = 1 glyph assumption
makes life very easy.
Using IBM's Internationalization Classes for Unicode
(bulk kudos to Mark Davis), it is quite straightforward
to add normalization to data import and character
entry in an interactive application. This means that
your application uses combined characters where they
are available rather than combining character sequences.
For most Western Latin languages, Unicode provides
pre-combined characters: enough even to support
Vietnamese with multiple levels of accent.
The other issue here is that 1 Java char = 1 glyph
assumption does not imply that every character is
the same width: if you support proportional width
characters you can still support Chinese and Japanese.
The W3C I18n WG has a new version of their "Character
Model for the WWW" at http://www.w3.org/TR/
which is looking pretty good. It is really well written
and anyone who wants to get a grip on internationalization
or character issues should find it a good place to start.
Cheers
Rick Jelliffe
Topologi Pty. Ltd.
|