Lists Home |
Date Index |
On Thu, 2002-01-10 at 20:24, Rick Jelliffe wrote:
> And also, do surrogate pairs really introduce any issues that
> are not already present in combining character sequences?
Perhaps not, but they happen at a different level of processing.
Surrogate sequences require processing before combining character
sequences, unless there's a ban on surrogates participating in
combinations I haven't heard of.
Surrogates also interact more directly with the productions in XML 1.0 -
combining chars are permitted, but there's no need to perform the
combination to see if your characters are acceptable. Normalization is
a good idea, but not required for basic syntactical checking.
> Using IBM's Internationalization Classes for Unicode
> (bulk kudos to Mark Davis), it is quite straightforward
> to add normalization to data import and character
> entry in an interactive application. This means that
> your application uses combined characters where they
> are available rather than combining character sequences.
> For most Western Latin languages, Unicode provides
> pre-combined characters: enough even to support
> Vietnamese with multiple levels of accent.
This looks very cool, but it also seems like a lot more overhead than is
necessary for a trivial character check like Gorille performs.
> The other issue here is that 1 Java char = 1 glyph
> assumption does not imply that every character is
> the same width: if you support proportional width
> characters you can still support Chinese and Japanese.
> The W3C I18n WG has a new version of their "Character
> Model for the WWW" at http://www.w3.org/TR/
> which is looking pretty good. It is really well written
> and anyone who wants to get a grip on internationalization
> or character issues should find it a good place to start.
It's a great document, but its call for processing at the character
string level doesn't mesh well with the current exigencies of Java -
where a char is a glyph under many circumstances, not a glyph under
others, and normalizing combining characters doesn't help with surrogate
processing issues. I don't think normalization answers the kinds of
issues Gorille is designed to address.
Fortunately, I don't think surrogates will be a common problem for most
people (both developers and users), but they'll continue to irk a lot of
people dealing with Java.
Ring around the content, a pocket full of brackets
Errors, errors, all fall down!