Re: [xml-dev] ArchForms and LPDs

{{ Normalization:

For background, for readers who don't know what normalization is: consider A with an angstrom diacritical: a legacy character set may use two one character to represent A and one character to represent combining the angstrom, or it may use one. Unicode supports both forms ( U+0041 U+0301 i.e. NFD, and U+0058 i.e. NFC) , and they are invisible to the eye and disruptive for simple collating and string matching. So Unicode supports various kinds of decomponsing and combining operations, called normalization. W3C has a Character Model specification which recommends using Unicode Normalization Form C.

}}

First, to confirm the status quo: as I understand it:

XML 1.0 (5th) says non-normatively to use NFC for 1. names: element names, attribute names, entity names, name tokens in attribute values as best practise.
XML 1.1 "defines a set of constraints called "full normalization" on XML documents, which document creators SHOULD adhere to, and document processors SHOULD verify. Using fully normalized documents ensures that identity comparisons of names, attribute values, and character content can be made correctly by simple binary comparison of Unicode strings." This uses NFC.

W3C Charmod (https://www.w3.org/TR/charmod-norm/#unicodeNormalization) does not endorse blanket normalization of a document before parsing. (I believe one of the reason why is because many fonts are normalization-form dependent, so arbitrary normalization can be unproductive.) It likes NFC for comparisons etc. Therefore, it seems to me that XML 1.1 may not conform to W3C CharMod, while XML 1.0 does, in this respect.

(My proposal for my system is that normalization of names (to NFC) is a server-side responsibility, which clients may check for: or they may build name normalization in themselves too. This only applies to tokens that are not in double quotes, not to strings or literals. (I will update the documentation on www.schematron.com for RAN: Random Access Notation with this. )

On Sat, Jul 31, 2021 at 8:04 AM John Cowan <johnwcowan@gmail.com> wrote:

On Tue, Jul 27, 2021 at 11:44 AM Rick Jelliffe <rjelliffe@allette.com.au> wrote:

In XML, it is needed because XML supports data coming in with legacy character sets;

Not at all. Conversion from legacy charsets to Unicode ones already produces NFC normalization (except in a few rare cases like XCCS), because those charsets don't have combining characters, nor both Hangul jamo and Hangul syllables. It's data in Unicode charsets that may or may not be normalized.

I don't understand this. I don't think we disagree, but clearly there are transcoders in the wild that actually do not produce NFC for every legacy charset. (I think John may be reading "it is needed" as "the only reason it is needed" but I meant "it is at least needed".)

Normalization had to be the responsibility of the receiver system because it could not be the responsibility of the generating system.

Well, it was originally the *creating* system that is supposed to NFC-normalize, and neither the receiving system nor a retransmitting system. But that has never applied to XML or HTML, and as a systems property is too hard to manage. So you should normalize just in case you need to compare: it's not normalization but equality under normalization that really matters.

Yes. But hard does not mean impossible: if you have a media-type indented for speed or random-access, then it may become in the sender's interest to produce normalized data. (And especially if the media-type was developed mainly for trusted and private use.)

Really, the issue is building normalization checking into the APIs for creating element objects, etc., which requires doing it on the ground floor.

Rick