[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Best Practice for designing XML vocabularies containing accentedcharacters -- allow both composed and decomposed forms
- From: "Costello, Roger L." <costello@mitre.org>
- To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
- Date: Sat, 2 Feb 2013 19:03:49 +0000
Hi Folks,
I propose the following as Best Practice:
For elements and attributes that have accents,
allow users to express them in either composed
normalized form (NFC) or decomposed normalized
form (NFD).
Example: suppose that your XML vocabulary is to contain this element:
<résumé>
Notice the two accented characters.
There are two standard, canonical ways to express those accented characters:
1. Normalization Form Composed (NFC): the accented character is expressed as a single composed character (U+E9 LATIN SMALL LETTER E WITH ACUTE)
2. Normalization Form Decomposed (NFD): the accented character is expressed as a decomposed sequence to two characters (U+65 LATIN SMALL LETTER E, U+301 COMBINING ACUTE ACCENT)
In the following XML document the first <résumé> element is expressed using NFC. The second is expressed using NFD:
<?xml version="1.0" encoding="UTF-8"?>
<Test>
<résumé>____</résumé>
<reěsumeě>____</reěsumeě>
</Test>
The two <résumé> elements appear the same, don’t they? That’s a neat thing about NFC and NFD -- visualization tools display them the same way.
In order for users to express accented elements and attributes in either NFC or NFD, design your XML Schemas using a <xs:choice> element. In the following XSD snippet the first résumé is NFC and the second is NFD:
<xs:choice>
<xs:element name="résumé" type="xs:string" />
<xs:element name="reěsumeě" type="xs:string" />
</xs:choice>
By designing your schemas in this fashion you empower your instance document authors to use whatever normalization form they prefer (or their tools prefer).
I inquired on the Unicode mailing list about NFD. Here are my notes on their responses:
Most text exchanged on the Internet is NFC-encoded. However, you can't count on text to always be NFC-encoded. In fact, there are definite advantages to NFD-encoding text.
Some operating systems store filenames in NFD encoding.
It’s easier to remember a handful of useful composing accents than the much larger number of combined forms.
NFD makes the regular expressions used to qualify its contents much, *much* simpler. I imagine that things like fuzzy text matching are easier in NFD.
There are well-documented cases of, for example, keyboards that generate de-normalized sequences, file systems that use other forms, and tools which generate content that is not normalized. This content enters the Web in a non-NFC state.
It is easier to use a few keystrokes for combining accents than to set up compose key sequences for all the possible composed characters.
It’s easier to do searches and other text processing on NFD-encoded text.
Some Unicode-defined processes, such as capitalization, are not guaranteed to preserve normalization forms. So the result of converting a lowercase character in NFC may be a decomposed uppercase character sequence (i.e., NFD).
Thoughts?
/Roger
I have approximate answers and possible beliefs
in different degrees of certainty about different
things, but I'm not absolutely sure of anything.
Richard Feynman
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]