UTF's considered best practice [was: Re: [xml-dev] nextml]

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Jim DeLaHunt <from.xml-dev@jdlh.com>
To: Amelia A Lewis <amyzing@talsever.com>, Uche Ogbuji <uche@ogbuji.net>
Date: Wed, 8 Dec 2010 22:31:02 -0800

Amy, and all:

At 10:18 PM -0700 12/8/10, Uche Ogbuji wrote:
>On Wed, Dec 8, 2010 at 9:27 PM, Amelia A Lewis <amyzing@talsever.com> wrote:
>
>  >I've
>  >seen a number of "only UTF" comments, and I think that they're rather
>  >western-centric, so I'm thinking "no," there (if someone whose native
>  >language *isn't* west european proposes it, I might rethink)
>
>
>Rick Jelliffe brings one of the most complete and coherent
>Eastern/Western perspectives I've ever encountered, and his proposal
>says:
>
>"A Nuke document is UTF-8 in its external form. Inside a program,
>after parsing, it would typically use UTF16."
>
>Yes, we all know about the politics and inertia that have affected
>uptake of Unicode in some geographies, but the "UTF-8 or UTF-16" is
>there for a very strong pragmatic reason.  Dealing with a pretty
>open-ended world of character sets, as in XML 1.0 is one of the
>biggest factors that complicate and slow down parsers, even if you
>get someone else (e.g. ICU) to do the relatively hard bits....

I don't know much about XML (which is why I lurk here and learn), but 
I do know something about internationalisation.  Amy, I applaud your 
caution against western-centric limitations to any nextml.  I'm with 
Uche is saying that limiting any nextml proposal to Unicode 
Transformation Formats (UTF-8, UTF-16BE, UTF-16LE) are good 
internationalisation, not western-centric.  In contrast, any other 
text encoding will lock out some languages or other.

Best internationalisation practice is to process text in Unicode, and 
convert into a Unicode format on input, and convert back (if needed) 
on output.  I'm a regular attendee at the Internationalisation and 
Unicode Conferences, and this is the consistent recommendation. See:

"Handling character encodings in HTML and CSS"
<http://www.w3.org/International/tutorials/tutorial-char-enc/>

"Unicode nearing 50% of the web"
Key quote: "[Google has] long used Unicode as the internal format for 
all the text we search: any other encoding is first converted to 
Unicode for processing."
<http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html> 
(2010/01/28)

For nextml, I think it's fine to limit document encodings to UTF-8 
only, or UTF-8 plus UTF-16.  Let the generators and consumers 
transcode to other character sets if they think it important.  10 
years ago that wasn't a reasonable stance to take; documents encoded 
in Unicode were rare.  But now, more than 50% of the web is in 
Unicode:
<http://twitter.com/mark_e_davis/statuses/22673110887> (2010/08/31)
[Mark Davis is Internationalization Architect for Google, and 
President of the Unicode Consortium. He knows his stuff.]

Sometimes UTF-16 is a more compact representation, sometimes UTF-8 
is. It depends on the frequency distribution of characters in the 
document. But they have equivalent descriptive power; either can 
represent any sequence of Unicode characters.  If nextml adopts 
UTF-16, be aware that it can be serialised to bytes in either 
little-endian or big-endian order (UTF-16LE or UTF-16BE), so nextml 
should account for those possibilities. It should also allow for the 
special Byte-Order Mark character (BOM), which is used to distinguish 
the two.

See also:
"Benefits of the Unicode Character Standard" 
<http://www.i18nguy.com/UnicodeBenefits.html>

"Unicode in XML and other Markup Languages" 
<http://www.unicode.org/reports/tr20/>
<http://www.w3.org/TR/unicode-xml/>

"Best Practices for XML Internationalization" 
<http://www.w3.org/TR/xml-i18n-bp/>

So, even though my native language is western european, I hope you'll 
reconsider saying "yes" to UTF-8 and/or UTF-16 only for nextml.

At 10:18 PM -0700 12/8/10, Uche Ogbuji continued:
...
>If we want to have a strong diversity of well-performing and
>conforming tools, which I suspect is an important component of
>success for most of us considering XML-NG, I think "UTF-*-only" is
>the simple reality.  For me, UTF-8 or UTF-16 is certainly an
>improvement over JSON's UTF-8 only.
>
>I'm curious as to how that JSON limitation is affecting trends in
>text processing conventions in non-Western countries as "Web 2.0"
>becomes pervasive.

-- 
     --Jim DeLaHunt, jdlh@jdlh.com     http://blog.jdlh.com/ (http://jdlh.com/)
       multilingual websites consultant

       157-2906 West Broadway, Vancouver BC V6K 2G8, Canada
          Canada mobile +1-604-376-8953

References:
- nextml
  - From: Amelia A Lewis <amyzing@talsever.com>
- Re: [xml-dev] nextml
  - From: Uche Ogbuji <uche@ogbuji.net>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]