xml-dev - Re: [xml-dev] Unicode normalization in XML 1.1

Re: [xml-dev] Unicode normalization in XML 1.1

[ Lists Home | Date Index | Thread Index ]

To: <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] Unicode normalization in XML 1.1
From: "Rick Jelliffe" <ricko@allette.com.au>
Date: Fri, 4 Apr 2003 02:36:23 +1000
References: <000001c2f9f8$b333d780$6401a8c0@pcukmka>

From: "Michael Kay" <michael.h.kay@ntlworld.com>

> While this policy makes sense, its translation into rules for software
> components is unfortunately full of absurdities. The fact that the
> character model [1] bans text processing software from doing
> normalization [2] means that senders are going to have a tough job
> meeting the requirement to normalize the text, because they won't be
> able to find any text processing software that does the job for them.

> [1] http://www.w3.org/TR/charmod/
> [2] Section 4.4: "A text processing component .... must not normalize
suspect text".

When reading charmod, you must keep the context in mind: it is written for
"for interoperable text manipulation on the World Wide Web".  Note that the
spec talks of Web components: servers, proxies, clients, etc.  Not text processing
applications in general.

In other words, Charmod may easily not apply to
  * systems of private exchange (where interoperability can be defined by 
    agreement rather than policy)
  * intranets (where you are not on the World Wide Web)
  * processing documents locally on a machine (again you are not on the 
   World Wide Web.)

For example, my company's editor normalizes all text coming in. But I do
not believe that goes against Charmod. Indeed, it is one way to insure
early uniform normalization.  

What Charmod does is to say that it is
sender of data's job to make sure their data is uniform: the receiver
should not have to worry.   Rather than considering it an added burden
on senders, we can think of it as letting recipients off the hook: 
you don't have to normalize both your inputs and output, just your
outputs.  And, if you are a sender, you can largely do 
this by selecting an output encoding that only has
normalized characters. 

To implement it, you might put a switch in SAXON for
"server mode" which switches off any normalization.
The other thing is that really normalization should be 
performed by transcoders automatically: one effect
of Charmod will be that transcoder developers can
be expected to move over to generating normalized
characters rather than trying to round-trip combining 
characters.  That will reduce early normalization largely
to an issue of warning generation.

Cheers
Rick Jelliffe
(Invited expert, W3C I18n IG, views my own)

Follow-Ups:
- RE: [xml-dev] Unicode normalization in XML 1.1
  - From: "Michael Kay" <michael.h.kay@ntlworld.com>

References:
- RE: [xml-dev] Unicode normalization in XML 1.1
  - From: "Michael Kay" <michael.h.kay@ntlworld.com>

Prev by Date: Bis on the apparent importance of emoticons (was "Design as, one hopes,not premature optimization")
Next by Date: Re: [xml-dev] On the aparent importance of emoticons (was "Design as, one hopes, not premature optimization")
Previous by thread: RE: [xml-dev] Unicode normalization in XML 1.1
Next by thread: RE: [xml-dev] Unicode normalization in XML 1.1
Index(es):
- Date
- Thread