XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Best Practice for designing XML vocabularies containingaccented characters -- allow both composed and decomposed forms


Roger, stop reinventing the wheel. This is all known territory you are 
exploring. Read

http://www.w3.org/TR/charmod-norm/

and if you think it's wrong, tell us why.

Michael Kay
Saxonica


On 02/02/2013 19:03, Costello, Roger L. wrote:
> Hi Folks,
>
> I propose the following as Best Practice:
>
> 	For elements and attributes that have accents,
> 	allow users to express them in either composed
> 	normalized form (NFC) or decomposed normalized
> 	form (NFD).
>
> Example: suppose that your XML vocabulary is to contain this element:
>
> 	<résumé>
>
> Notice the two accented characters.
>
> There are two standard, canonical ways to express those accented characters:
>
> 1. Normalization Form Composed (NFC): the accented character is expressed as a single composed character (U+E9 LATIN SMALL LETTER E WITH ACUTE)
>
> 2. Normalization Form Decomposed (NFD): the accented character is expressed as a decomposed sequence to two characters (U+65 LATIN SMALL LETTER E, U+301 COMBINING ACUTE ACCENT)
>
> In the following XML document the first <résumé> element is expressed using NFC. The second is expressed using NFD:
>
> 	<?xml version="1.0" encoding="UTF-8"?>
> 	<Test>
> 	        <résumé>____</résumé>
> 	        <reěsumeě>____</reěsumeě>
> 	</Test>
>
> The two <résumé> elements appear the same, don’t they? That’s a neat thing about NFC and NFD -- visualization tools display them the same way.
>
> In order for users to express accented elements and attributes in either NFC or NFD, design your XML Schemas using a <xs:choice> element. In the following XSD snippet the first résumé is NFC and the second is NFD:
>
>              <xs:choice>
>                  <xs:element name="résumé" type="xs:string" />
>                  <xs:element name="reěsumeě" type="xs:string" />
>              </xs:choice>
>
> By designing your schemas in this fashion you empower your instance document authors to use whatever normalization form they prefer (or their tools prefer).
>
> I inquired on the Unicode mailing list about NFD. Here are my notes on their responses:
>
> Most text exchanged on the Internet is NFC-encoded. However, you can't count on text to always be NFC-encoded. In fact, there are definite advantages to NFD-encoding text.
>
> Some operating systems store filenames in NFD encoding.
>
> It’s easier to remember a handful of useful composing accents than the much larger number of combined forms.
>
> NFD makes the regular expressions used to qualify its contents much, *much* simpler.  I imagine that things like fuzzy text matching are easier in NFD.
>
> There are well-documented cases of, for example, keyboards that generate de-normalized sequences, file systems that use other forms, and tools which generate content that is not normalized. This content enters the Web in a non-NFC state.
>
> It is easier to use a few keystrokes for combining accents than to set up compose key sequences for all the possible composed characters.
>
> It’s easier to do searches and other text processing on NFD-encoded text.
>
> Some Unicode-defined processes, such as capitalization, are not guaranteed to preserve normalization forms. So the result of converting a lowercase character in NFC may be a decomposed uppercase character sequence (i.e., NFD).
>
> Thoughts?
>
> /Roger
>
>      I have approximate answers and possible beliefs
>      in different degrees of certainty about different
>      things, but I'm not absolutely sure of anything.
>
>                                               Richard Feynman
>
> _______________________________________________________________________
>
> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
> to support XML implementation and development. To minimize
> spam in the archives, you must subscribe before posting.
>
> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
> subscribe: xml-dev-subscribe@lists.xml.org
> List archive: http://lists.xml.org/archives/xml-dev/
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>
>



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS