[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Best Practice for designing XML vocabularies containing accented characters -- allow both composed and decomposed forms
- From: Jim Melton <jim.melton@oracle.com>
- To: "Costello, Roger L." <costello@mitre.org>
- Date: Sat, 02 Feb 2013 14:03:50 -0700
Roger,
This is, IMHO, a Really Bad Idea. It would be
far better to automatically (e.g., via a script)
normalize all input documents before validating or otherwise processing them.
Your proposal addresses only a tiny fraction of
the possible character-based "gotchas" and
probably not the most important fraction, either.
Hope this helps,
Jim
At 2/2/2013 12:03 PM, Costello, Roger L. wrote:
>Hi Folks,
>
>I propose the following as Best Practice:
>
> For elements and attributes that have accents,
> allow users to express them in either composed
> normalized form (NFC) or decomposed normalized
> form (NFD).
>
>Example: suppose that your XML vocabulary is to contain this element:
>
> <résumé>
>
>Notice the two accented characters.
>
>There are two standard, canonical ways to express those accented characters:
>
>1. Normalization Form Composed (NFC): the
>accented character is expressed as a single
>composed character (U+E9 LATIN SMALL LETTER E WITH ACUTE)
>
>2. Normalization Form Decomposed (NFD): the
>accented character is expressed as a decomposed
>sequence to two characters (U+65 LATIN SMALL
>LETTER E, U+301 COMBINING ACUTE ACCENT)
>
>In the following XML document the first <résumé>
>element is expressed using NFC. The second is expressed using NFD:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <Test>
> <résumé>____</résumé>
> <reěsumeě>____</reěsumeě>
> </Test>
>
>The two <résumé> elements appear the same, don’t
>they? That’s a neat thing about NFC and NFD --
>visualization tools display them the same way.
>
>In order for users to express accented elements
>and attributes in either NFC or NFD, design your
>XML Schemas using a <xs:choice> element. In the
>following XSD snippet the first résumé is NFC and the second is NFD:
>
> <xs:choice>
> <xs:element name="résumé" type="xs:string" />
> <xs:element name="reěsumeě" type="xs:string" />
> </xs:choice>
>
>By designing your schemas in this fashion you
>empower your instance document authors to use
>whatever normalization form they prefer (or their tools prefer).
>
>I inquired on the Unicode mailing list about
>NFD. Here are my notes on their responses:
>
>Most text exchanged on the Internet is
>NFC-encoded. However, you can't count on text to
>always be NFC-encoded. In fact, there are
>definite advantages to NFD-encoding text.
>
>Some operating systems store filenames in NFD encoding.
>
>It’s easier to remember a handful of useful
>composing accents than the much larger number of combined forms.
>
>NFD makes the regular expressions used to
>qualify its contents much, *much* simpler. I
>imagine that things like fuzzy text matching are easier in NFD.
>
>There are well-documented cases of, for example,
>keyboards that generate de-normalized sequences,
>file systems that use other forms, and tools
>which generate content that is not normalized.
>This content enters the Web in a non-NFC state.
>
>It is easier to use a few keystrokes for
>combining accents than to set up compose key
>sequences for all the possible composed characters.
>
>It’s easier to do searches and other text processing on NFD-encoded text.
>
>Some Unicode-defined processes, such as
>capitalization, are not guaranteed to preserve
>normalization forms. So the result of converting
>a lowercase character in NFC may be a decomposed
>uppercase character sequence (i.e., NFD).
>
>Thoughts?
>
>/Roger
>
> I have approximate answers and possible beliefs
> in different degrees of certainty about different
> things, but I'm not absolutely sure of anything.
>
> Richard Feynman
>
>_______________________________________________________________________
>
>XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>to support XML implementation and development. To minimize
>spam in the archives, you must subscribe before posting.
>
>[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>subscribe: xml-dev-subscribe@lists.xml.org
>List archive: http://lists.xml.org/archives/xml-dev/
>List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144
Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG Fax : +1.801.942.3345
Oracle Corporation Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive Alternate email: jim dot melton at acm dot org
Sandy, UT 84093-1063 USA Personal email: SheltieJim at xmission dot com
========================================================================
= Facts are facts. But any opinions expressed are the opinions =
= only of myself and may or may not reflect the opinions of anybody =
= else with whom I may or may not have discussed the issues at hand. =
========================================================================
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]