Lists Home |
Date Index |
From: "Michael Kay" <firstname.lastname@example.org>
> While this policy makes sense, its translation into rules for software
> components is unfortunately full of absurdities. The fact that the
> character model  bans text processing software from doing
> normalization  means that senders are going to have a tough job
> meeting the requirement to normalize the text, because they won't be
> able to find any text processing software that does the job for them.
>  http://www.w3.org/TR/charmod/
>  Section 4.4: "A text processing component .... must not normalize
When reading charmod, you must keep the context in mind: it is written for
"for interoperable text manipulation on the World Wide Web". Note that the
spec talks of Web components: servers, proxies, clients, etc. Not text processing
applications in general.
In other words, Charmod may easily not apply to
* systems of private exchange (where interoperability can be defined by
agreement rather than policy)
* intranets (where you are not on the World Wide Web)
* processing documents locally on a machine (again you are not on the
World Wide Web.)
For example, my company's editor normalizes all text coming in. But I do
not believe that goes against Charmod. Indeed, it is one way to insure
early uniform normalization.
What Charmod does is to say that it is
sender of data's job to make sure their data is uniform: the receiver
should not have to worry. Rather than considering it an added burden
on senders, we can think of it as letting recipients off the hook:
you don't have to normalize both your inputs and output, just your
outputs. And, if you are a sender, you can largely do
this by selecting an output encoding that only has
To implement it, you might put a switch in SAXON for
"server mode" which switches off any normalization.
The other thing is that really normalization should be
performed by transcoders automatically: one effect
of Charmod will be that transcoder developers can
be expected to move over to generating normalized
characters rather than trying to round-trip combining
characters. That will reduce early normalization largely
to an issue of warning generation.
(Invited expert, W3C I18n IG, views my own)