OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Unicode normalization in XML 1.1

[ Lists Home | Date Index | Thread Index ]

From: "Michael Kay" <michael.h.kay@ntlworld.com>

> While this policy makes sense, its translation into rules for software
> components is unfortunately full of absurdities. The fact that the
> character model [1] bans text processing software from doing
> normalization [2] means that senders are going to have a tough job
> meeting the requirement to normalize the text, because they won't be
> able to find any text processing software that does the job for them.

> [1] http://www.w3.org/TR/charmod/
> [2] Section 4.4: "A text processing component .... must not normalize
suspect text".

When reading charmod, you must keep the context in mind: it is written for
"for interoperable text manipulation on the World Wide Web".  Note that the
spec talks of Web components: servers, proxies, clients, etc.  Not text processing
applications in general.

In other words, Charmod may easily not apply to
  * systems of private exchange (where interoperability can be defined by 
    agreement rather than policy)
  * intranets (where you are not on the World Wide Web)
  * processing documents locally on a machine (again you are not on the 
   World Wide Web.)

For example, my company's editor normalizes all text coming in. But I do
not believe that goes against Charmod. Indeed, it is one way to insure
early uniform normalization.  

What Charmod does is to say that it is
sender of data's job to make sure their data is uniform: the receiver
should not have to worry.   Rather than considering it an added burden
on senders, we can think of it as letting recipients off the hook: 
you don't have to normalize both your inputs and output, just your
outputs.  And, if you are a sender, you can largely do 
this by selecting an output encoding that only has
normalized characters. 

To implement it, you might put a switch in SAXON for
"server mode" which switches off any normalization.
The other thing is that really normalization should be 
performed by transcoders automatically: one effect
of Charmod will be that transcoder developers can
be expected to move over to generating normalized
characters rather than trying to round-trip combining 
characters.  That will reduce early normalization largely
to an issue of warning generation.

Rick Jelliffe
(Invited expert, W3C I18n IG, views my own)


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS