[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
UTF's considered best practice [was: Re: [xml-dev] nextml]
- From: Jim DeLaHunt <from.xml-dev@jdlh.com>
- To: Amelia A Lewis <amyzing@talsever.com>, Uche Ogbuji <uche@ogbuji.net>
- Date: Wed, 8 Dec 2010 22:31:02 -0800
Amy, and all:
At 10:18 PM -0700 12/8/10, Uche Ogbuji wrote:
>On Wed, Dec 8, 2010 at 9:27 PM, Amelia A Lewis <amyzing@talsever.com> wrote:
>
> >I've
> >seen a number of "only UTF" comments, and I think that they're rather
> >western-centric, so I'm thinking "no," there (if someone whose native
> >language *isn't* west european proposes it, I might rethink)
>
>
>Rick Jelliffe brings one of the most complete and coherent
>Eastern/Western perspectives I've ever encountered, and his proposal
>says:
>
>"A Nuke document is UTF-8 in its external form. Inside a program,
>after parsing, it would typically use UTF16."
>
>Yes, we all know about the politics and inertia that have affected
>uptake of Unicode in some geographies, but the "UTF-8 or UTF-16" is
>there for a very strong pragmatic reason. Dealing with a pretty
>open-ended world of character sets, as in XML 1.0 is one of the
>biggest factors that complicate and slow down parsers, even if you
>get someone else (e.g. ICU) to do the relatively hard bits....
I don't know much about XML (which is why I lurk here and learn), but
I do know something about internationalisation. Amy, I applaud your
caution against western-centric limitations to any nextml. I'm with
Uche is saying that limiting any nextml proposal to Unicode
Transformation Formats (UTF-8, UTF-16BE, UTF-16LE) are good
internationalisation, not western-centric. In contrast, any other
text encoding will lock out some languages or other.
Best internationalisation practice is to process text in Unicode, and
convert into a Unicode format on input, and convert back (if needed)
on output. I'm a regular attendee at the Internationalisation and
Unicode Conferences, and this is the consistent recommendation. See:
"Handling character encodings in HTML and CSS"
<http://www.w3.org/International/tutorials/tutorial-char-enc/>
"Unicode nearing 50% of the web"
Key quote: "[Google has] long used Unicode as the internal format for
all the text we search: any other encoding is first converted to
Unicode for processing."
<http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html>
(2010/01/28)
For nextml, I think it's fine to limit document encodings to UTF-8
only, or UTF-8 plus UTF-16. Let the generators and consumers
transcode to other character sets if they think it important. 10
years ago that wasn't a reasonable stance to take; documents encoded
in Unicode were rare. But now, more than 50% of the web is in
Unicode:
<http://twitter.com/mark_e_davis/statuses/22673110887> (2010/08/31)
[Mark Davis is Internationalization Architect for Google, and
President of the Unicode Consortium. He knows his stuff.]
Sometimes UTF-16 is a more compact representation, sometimes UTF-8
is. It depends on the frequency distribution of characters in the
document. But they have equivalent descriptive power; either can
represent any sequence of Unicode characters. If nextml adopts
UTF-16, be aware that it can be serialised to bytes in either
little-endian or big-endian order (UTF-16LE or UTF-16BE), so nextml
should account for those possibilities. It should also allow for the
special Byte-Order Mark character (BOM), which is used to distinguish
the two.
See also:
"Benefits of the Unicode Character Standard"
<http://www.i18nguy.com/UnicodeBenefits.html>
"Unicode in XML and other Markup Languages"
<http://www.unicode.org/reports/tr20/>
<http://www.w3.org/TR/unicode-xml/>
"Best Practices for XML Internationalization"
<http://www.w3.org/TR/xml-i18n-bp/>
So, even though my native language is western european, I hope you'll
reconsider saying "yes" to UTF-8 and/or UTF-16 only for nextml.
At 10:18 PM -0700 12/8/10, Uche Ogbuji continued:
...
>If we want to have a strong diversity of well-performing and
>conforming tools, which I suspect is an important component of
>success for most of us considering XML-NG, I think "UTF-*-only" is
>the simple reality. For me, UTF-8 or UTF-16 is certainly an
>improvement over JSON's UTF-8 only.
>
>I'm curious as to how that JSON limitation is affecting trends in
>text processing conventions in non-Western countries as "Web 2.0"
>becomes pervasive.
--
--Jim DeLaHunt, jdlh@jdlh.com http://blog.jdlh.com/ (http://jdlh.com/)
multilingual websites consultant
157-2906 West Broadway, Vancouver BC V6K 2G8, Canada
Canada mobile +1-604-376-8953
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]