OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Unicode normalization in XML 1.1

[ Lists Home | Date Index | Thread Index ]

Lars Marius Garshol scripsit:

>  - clearly, documents that are not normalized are still well-formed,
>    so if the application is to have any guarantees here the processor
>    must do normalization before passing on the information,

Not so.  A processor in normalization-check mode will report non-normalized
input, so the application may make up its mind whether or not to accept it.

>  - the text says that "XML processors must not transform the input to
>    be in fully normalized form." This seems to say that processors are
>    not allowed to do the transformation.


> Wouldn't it be far better if the application could be certain
> that an XML 1.1 processor would provide normalized character data and
> to ignore the whole issue of how the document was encoded? After all,
> isn't the whole purpose of *having* XML parsers to insulate
> applications from worries about the lexical details of documents?

The point is that normalization is expensive, and it may be too expensive
to do at all in small systems.  Therefore, the W3C's choice (expressed
in the Character Model) is to have senders normalize, and receivers check
for normalization.  In this way documents are normalized once at creation
(or publication) time, rather than every time a document is received; this
conserves net-wide cycles, since checking is cheaper than normalizing.

> In other words, why not rewrite this so that processors are required
> to normalize character data? 

Forcibly normalizing incoming documents can spoof signature schemes, and
can also render documents well-formed that were not well-formed before
(e.g. if a start-tag uses A WITH ACUTE and the end-tag uses A followed
by COMBINING ACUTE).  http://www.w3.org/TR/charmod/#sec-Normalization
goes into more detail.

John Cowan           http://www.ccil.org/~cowan              cowan@ccil.org
To say that Bilbo's breath was taken away is no description at all.  There
are no words left to express his staggerment, since Men changed the language
that they learned of elves in the days when all the world was wonderful.
        --_The Hobbit_


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS