Re: [xml-dev] ArchForms and LPDs

On Mon, Jul 26, 2021 at 6:02 AM Liam R. E. Quin <liam@fromoldbooks.org> wrote:

In-place parsing isn't going to fly in a world with XInclude, nor for
that matter with NFC normalization.

Yes, for XInclude.

And it is good to bring up unicode normalization: is it really a showstopper? I don't think so.

In XML, it is needed because XML supports data coming in with legacy character sets; either directly (i.e. what is said in the XML header encoding) or indirectly (e.g. from an editor or database using the legacy character set but simply transcoded into UTF-n maybe via Java UTF-16.) Normalization had to be the responsibility of the receiver system because it could not be the responsibility of the generating system.

But I think the trade-offs change when

1. the new language only supports UTF-8 and 16

2. it is harder to find systems that expose legacy encodings than not, now: remember that the first version of Unicode will be 30 in a couple of months!

3. the main platforms provide normalization functions

4. XML, Java, C#, JSON etc have moved most web-based system to Unicode

5. developers are much more aware of what is needed (I mean developers in locales where the issue can arise: Thailand for example) than they were in 25 years ago, where most developers only ever worked with a single encoding, which was whatever was used by their platform in their locale.

So, given that, it seems feasible just make normalization the responsibility of the generator of documents, not the receiver (parser)? We could specify some normalization form (NFC?) as the nominal default, but because it is not relied on by the parser, deployers of private systems don't need to bear the cost of unnecessary or unwanted normalization. If the generator of the document sends characters in unexpected forms, it may not match or parse correctly.

This less helpful/nannying/infallable approach than XML fits in with,say, the idea that in lexing/parsing a RAN document, only errors in parts looked at need to be reported.

(Which is not to say that there are no normalizations that can be done efficiently using SIMD or GPUs: the normalizations that are merely looking up normalization classes and reducing one string to another seem amenable, but the normalizations involving re-ordering multiple accents may not be so amenable.)

Cheers

Rick