Re: [xml-dev] Best Practice for designing XML vocabularies containing ac

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Re: [xml-dev] Best Practice for designing XML vocabularies containing accented characters -- allow both composed and decomposed forms

From: Jim Melton <jim.melton@oracle.com>
To: "Costello, Roger L." <costello@mitre.org>
Date: Sat, 02 Feb 2013 14:03:50 -0700

Roger,

This is, IMHO, a Really Bad Idea.  It would be 
far better to automatically (e.g., via a script) 
normalize all input documents before validating or otherwise processing them.

Your proposal addresses only a tiny fraction of 
the possible character-based "gotchas" and 
probably not the most important fraction, either.

Hope this helps,
    Jim


At 2/2/2013 12:03 PM, Costello, Roger L. wrote:
>Hi Folks,
>
>I propose the following as Best Practice:
>
>     For elements and attributes that have accents,
>     allow users to express them in either composed
>     normalized form (NFC) or decomposed normalized
>     form (NFD).
>
>Example: suppose that your XML vocabulary is to contain this element:
>
>     <r�sum�>
>
>Notice the two accented characters.
>
>There are two standard, canonical ways to express those accented characters:
>
>1. Normalization Form Composed (NFC): the 
>accented character is expressed as a single 
>composed character (U+E9 LATIN SMALL LETTER E WITH ACUTE)
>
>2. Normalization Form Decomposed (NFD): the 
>accented character is expressed as a decomposed 
>sequence to two characters (U+65 LATIN SMALL 
>LETTER E, U+301 COMBINING ACUTE ACCENT)
>
>In the following XML document the first <r�sum�> 
>element is expressed using NFC. The second is expressed using NFD:
>
>     <?xml version="1.0" encoding="UTF-8"?>
>     <Test>
>             <r�sum�>____</r�sum�>
>             <re�sume�>____</re�sume�>
>     </Test>
>
>The two <r�sum�> elements appear the same, don�t 
>they? That�s a neat thing about NFC and NFD -- 
>visualization tools display them the same way.
>
>In order for users to express accented elements 
>and attributes in either NFC or NFD, design your 
>XML Schemas using a <xs:choice> element. In the 
>following XSD snippet the first r�sum� is NFC and the second is NFD:
>
>             <xs:choice>
>                 <xs:element name="r�sum�" type="xs:string" />
>                 <xs:element name="re�sume�" type="xs:string" />
>             </xs:choice>
>
>By designing your schemas in this fashion you 
>empower your instance document authors to use 
>whatever normalization form they prefer (or their tools prefer).
>
>I inquired on the Unicode mailing list about 
>NFD. Here are my notes on their responses:
>
>Most text exchanged on the Internet is 
>NFC-encoded. However, you can't count on text to 
>always be NFC-encoded. In fact, there are 
>definite advantages to NFD-encoding text.
>
>Some operating systems store filenames in NFD encoding.
>
>It�s easier to remember a handful of useful 
>composing accents than the much larger number of combined forms.
>
>NFD makes the regular expressions used to 
>qualify its contents much, *much* simpler.  I 
>imagine that things like fuzzy text matching are easier in NFD.
>
>There are well-documented cases of, for example, 
>keyboards that generate de-normalized sequences, 
>file systems that use other forms, and tools 
>which generate content that is not normalized. 
>This content enters the Web in a non-NFC state.
>
>It is easier to use a few keystrokes for 
>combining accents than to set up compose key 
>sequences for all the possible composed characters.
>
>It�s easier to do searches and other text processing on NFD-encoded text.
>
>Some Unicode-defined processes, such as 
>capitalization, are not guaranteed to preserve 
>normalization forms. So the result of converting 
>a lowercase character in NFC may be a decomposed 
>uppercase character sequence (i.e., NFD).
>
>Thoughts?
>
>/Roger
>
>     I have approximate answers and possible beliefs
>     in different degrees of certainty about different
>     things, but I'm not absolutely sure of anything.
>
>                                              Richard Feynman
>
>_______________________________________________________________________
>
>XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>to support XML implementation and development. To minimize
>spam in the archives, you must subscribe before posting.
>
>[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>subscribe: xml-dev-subscribe@lists.xml.org
>List archive: http://lists.xml.org/archives/xml-dev/
>List Guidelines: http://www.oasis-open.org/maillists/guidelines.php

========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
   Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG    Fax : +1.801.942.3345
Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive      Alternate email: jim dot melton at acm dot org
Sandy, UT 84093-1063 USA  Personal email: SheltieJim at xmission dot com
========================================================================
=  Facts are facts.   But any opinions expressed are the opinions      =
=  only of myself and may or may not reflect the opinions of anybody   =
=  else with whom I may or may not have discussed the issues at hand.  =
========================================================================

Follow-Ups:
- RE: [xml-dev] Best Practice for designing XML vocabularies containing accented characters -- allow both composed and decomposed forms
  - From: "Costello, Roger L." <costello@mitre.org>

References:
- Best Practice for designing XML vocabularies containing accentedcharacters -- allow both composed and decomposed forms
  - From: "Costello, Roger L." <costello@mitre.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]