xml-dev - Re: [xml-dev] ANN: Gorille 0.3

Re: [xml-dev] ANN: Gorille 0.3

[ Lists Home | Date Index | Thread Index ]

To: Rick Jelliffe <ricko@allette.com.au>
Subject: Re: [xml-dev] ANN: Gorille 0.3
From: "Simon St.Laurent" <simonstl@simonstl.com>
Date: 10 Jan 2002 22:27:36 -0500
Cc: xml-dev@lists.xml.org
In-reply-to: <005501c19a3e$a84b94e0$4bc8a8c0@AlletteSystems.com>
References: <200201102338.XAA06564@mcilvanney.cogsci.ed.ac.uk> <005501c19a3e$a84b94e0$4bc8a8c0@AlletteSystems.com>

On Thu, 2002-01-10 at 20:24, Rick Jelliffe wrote:
> And also, do surrogate pairs really introduce any issues that
> are not already present in combining character sequences?

Perhaps not, but they happen at a different level of processing. 
Surrogate sequences require processing before combining character
sequences, unless there's a ban on surrogates participating in
combinations I haven't heard of.

Surrogates also interact more directly with the productions in XML 1.0 -
combining chars are permitted, but there's no need to perform the
combination to see if your characters are acceptable.  Normalization is
a good idea, but not required for basic syntactical checking.

> Using IBM's Internationalization Classes for Unicode
> (bulk kudos to Mark Davis), it is quite straightforward
> to add normalization to data import and character
> entry in an interactive application. This means that
> your application uses combined characters where they
> are available rather than combining character sequences.
> For most Western Latin languages, Unicode provides
> pre-combined characters: enough even to support
> Vietnamese with multiple levels of accent. 

This looks very cool, but it also seems like a lot more overhead than is
necessary for a trivial character check like Gorille performs.

> The other issue here is that 1 Java char = 1 glyph 
> assumption does not imply that every character is
> the same width: if you support proportional width 
> characters you can still support Chinese and Japanese.
> 
> The W3C I18n WG has a new version of their "Character
> Model for the WWW" at http://www.w3.org/TR/
> which is looking pretty good.  It is really well written
> and anyone who wants to get a grip on internationalization
> or character issues should find it a good place to start. 

It's a great document, but its call for processing at the character
string level doesn't mesh well with the current exigencies of Java -
where a char is a glyph under many circumstances, not a glyph under
others, and normalizing combining characters doesn't help with surrogate
processing issues.  I don't think normalization answers the kinds of
issues Gorille is designed to address.

Fortunately, I don't think surrogates will be a common problem for most
people (both developers and users), but they'll continue to irk a lot of
people dealing with Java.

-- 
Simon St.Laurent
Ring around the content, a pocket full of brackets
Errors, errors, all fall down!
http://simonstl.com

References:
- Re: [xml-dev] ANN: Gorille 0.3
  - From: Richard Tobin <richard@cogsci.ed.ac.uk>
- Re: [xml-dev] ANN: Gorille 0.3
  - From: "Rick Jelliffe" <ricko@allette.com.au>

Prev by Date: RE: [xml-dev] [ANN] XML Limerick Competition
Next by Date: I18n and SAX Locator ( was Re: SAX2 r2 ... last call!)
Previous by thread: Re: [xml-dev] ANN: Gorille 0.3
Next by thread: Re: [xml-dev] ANN: Gorille 0.3
Index(es):
- Date
- Thread