Best Practice for designing XML vocabularies containing accentedcharacte

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Best Practice for designing XML vocabularies containing accentedcharacters -- allow both composed and decomposed forms

From: "Costello, Roger L." <costello@mitre.org>
To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
Date: Sat, 2 Feb 2013 19:03:49 +0000

Hi Folks,

I propose the following as Best Practice:

	For elements and attributes that have accents,
	allow users to express them in either composed
	normalized form (NFC) or decomposed normalized
	form (NFD).

Example: suppose that your XML vocabulary is to contain this element:

	<r�sum�>

Notice the two accented characters. 

There are two standard, canonical ways to express those accented characters:

1. Normalization Form Composed (NFC): the accented character is expressed as a single composed character (U+E9 LATIN SMALL LETTER E WITH ACUTE)

2. Normalization Form Decomposed (NFD): the accented character is expressed as a decomposed sequence to two characters (U+65 LATIN SMALL LETTER E, U+301 COMBINING ACUTE ACCENT)

In the following XML document the first <r�sum�> element is expressed using NFC. The second is expressed using NFD:

	<?xml version="1.0" encoding="UTF-8"?>
	<Test>
	        <r�sum�>____</r�sum�>
	        <re�sume�>____</re�sume�>
	</Test>

The two <r�sum�> elements appear the same, don�t they? That�s a neat thing about NFC and NFD -- visualization tools display them the same way.

In order for users to express accented elements and attributes in either NFC or NFD, design your XML Schemas using a <xs:choice> element. In the following XSD snippet the first r�sum� is NFC and the second is NFD:

            <xs:choice>
                <xs:element name="r�sum�" type="xs:string" />
                <xs:element name="re�sume�" type="xs:string" />
            </xs:choice>

By designing your schemas in this fashion you empower your instance document authors to use whatever normalization form they prefer (or their tools prefer).

I inquired on the Unicode mailing list about NFD. Here are my notes on their responses:

Most text exchanged on the Internet is NFC-encoded. However, you can't count on text to always be NFC-encoded. In fact, there are definite advantages to NFD-encoding text.

Some operating systems store filenames in NFD encoding.

It�s easier to remember a handful of useful composing accents than the much larger number of combined forms.

NFD makes the regular expressions used to qualify its contents much, *much* simpler.  I imagine that things like fuzzy text matching are easier in NFD.

There are well-documented cases of, for example, keyboards that generate de-normalized sequences, file systems that use other forms, and tools which generate content that is not normalized. This content enters the Web in a non-NFC state.

It is easier to use a few keystrokes for combining accents than to set up compose key sequences for all the possible composed characters.

It�s easier to do searches and other text processing on NFD-encoded text.

Some Unicode-defined processes, such as capitalization, are not guaranteed to preserve normalization forms. So the result of converting a lowercase character in NFC may be a decomposed uppercase character sequence (i.e., NFD).

Thoughts?

/Roger

    I have approximate answers and possible beliefs 
    in different degrees of certainty about different 
    things, but I'm not absolutely sure of anything.

                                             Richard Feynman

Follow-Ups:
- Re: [xml-dev] Best Practice for designing XML vocabularies containingaccented characters -- allow both composed and decomposed forms
  - From: Michael Kay <mike@saxonica.com>
- Re: [xml-dev] Best Practice for designing XML vocabularies containing accented characters -- allow both composed and decomposed forms
  - From: Jim Melton <jim.melton@oracle.com>
- Re: [xml-dev] Best Practice for designing XML vocabularies containing accented characters -- allow both composed and decomposed forms
  - From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Re: [xml-dev] Best Practice for designing XML vocabulariescontaining accented characters -- allow both composed and decomposed forms
  - From: Liam R E Quin <liam@w3.org>
- RE: Best Practice for designing XML vocabularies containingaccented characters -- allow both composed and decomposed forms
  - From: David Lee <dlee@calldei.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]