OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Best Practice for designing XML vocabularies containing accentedcharacters -- allow both composed and decomposed forms

Hi Folks,

I propose the following as Best Practice:

	For elements and attributes that have accents,
	allow users to express them in either composed
	normalized form (NFC) or decomposed normalized
	form (NFD).

Example: suppose that your XML vocabulary is to contain this element:


Notice the two accented characters. 

There are two standard, canonical ways to express those accented characters:

1. Normalization Form Composed (NFC): the accented character is expressed as a single composed character (U+E9 LATIN SMALL LETTER E WITH ACUTE)

2. Normalization Form Decomposed (NFD): the accented character is expressed as a decomposed sequence to two characters (U+65 LATIN SMALL LETTER E, U+301 COMBINING ACUTE ACCENT)

In the following XML document the first <résumé> element is expressed using NFC. The second is expressed using NFD:

	<?xml version="1.0" encoding="UTF-8"?>

The two <résumé> elements appear the same, don’t they? That’s a neat thing about NFC and NFD -- visualization tools display them the same way.

In order for users to express accented elements and attributes in either NFC or NFD, design your XML Schemas using a <xs:choice> element. In the following XSD snippet the first résumé is NFC and the second is NFD:

                <xs:element name="résumé" type="xs:string" />
                <xs:element name="reěsumeě" type="xs:string" />

By designing your schemas in this fashion you empower your instance document authors to use whatever normalization form they prefer (or their tools prefer).

I inquired on the Unicode mailing list about NFD. Here are my notes on their responses:

Most text exchanged on the Internet is NFC-encoded. However, you can't count on text to always be NFC-encoded. In fact, there are definite advantages to NFD-encoding text.

Some operating systems store filenames in NFD encoding.

It’s easier to remember a handful of useful composing accents than the much larger number of combined forms.

NFD makes the regular expressions used to qualify its contents much, *much* simpler.  I imagine that things like fuzzy text matching are easier in NFD.

There are well-documented cases of, for example, keyboards that generate de-normalized sequences, file systems that use other forms, and tools which generate content that is not normalized. This content enters the Web in a non-NFC state.

It is easier to use a few keystrokes for combining accents than to set up compose key sequences for all the possible composed characters.

It’s easier to do searches and other text processing on NFD-encoded text.

Some Unicode-defined processes, such as capitalization, are not guaranteed to preserve normalization forms. So the result of converting a lowercase character in NFC may be a decomposed uppercase character sequence (i.e., NFD).



    I have approximate answers and possible beliefs 
    in different degrees of certainty about different 
    things, but I'm not absolutely sure of anything.

                                             Richard Feynman

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS