OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] ANN: Gorille 0.3

[ Lists Home | Date Index | Thread Index ]

On Thu, 2002-01-10 at 20:24, Rick Jelliffe wrote:
> And also, do surrogate pairs really introduce any issues that
> are not already present in combining character sequences?

Perhaps not, but they happen at a different level of processing. 
Surrogate sequences require processing before combining character
sequences, unless there's a ban on surrogates participating in
combinations I haven't heard of.

Surrogates also interact more directly with the productions in XML 1.0 -
combining chars are permitted, but there's no need to perform the
combination to see if your characters are acceptable.  Normalization is
a good idea, but not required for basic syntactical checking.

> Using IBM's Internationalization Classes for Unicode
> (bulk kudos to Mark Davis), it is quite straightforward
> to add normalization to data import and character
> entry in an interactive application. This means that
> your application uses combined characters where they
> are available rather than combining character sequences.
> For most Western Latin languages, Unicode provides
> pre-combined characters: enough even to support
> Vietnamese with multiple levels of accent. 

This looks very cool, but it also seems like a lot more overhead than is
necessary for a trivial character check like Gorille performs.

> The other issue here is that 1 Java char = 1 glyph 
> assumption does not imply that every character is
> the same width: if you support proportional width 
> characters you can still support Chinese and Japanese.
> The W3C I18n WG has a new version of their "Character
> Model for the WWW" at http://www.w3.org/TR/
> which is looking pretty good.  It is really well written
> and anyone who wants to get a grip on internationalization
> or character issues should find it a good place to start. 

It's a great document, but its call for processing at the character
string level doesn't mesh well with the current exigencies of Java -
where a char is a glyph under many circumstances, not a glyph under
others, and normalizing combining characters doesn't help with surrogate
processing issues.  I don't think normalization answers the kinds of
issues Gorille is designed to address.

Fortunately, I don't think surrogates will be a common problem for most
people (both developers and users), but they'll continue to irk a lot of
people dealing with Java.

Simon St.Laurent
Ring around the content, a pocket full of brackets
Errors, errors, all fall down!


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS