[
Lists Home |
Date Index |
Thread Index
]
Alaric B Snell <alaric@alaric-snell.com> wrote at Fri, 21 Nov 2003 13:36:24 +0000:
> Tony Graham wrote:
>
> > Changing UTF-16 Chinese to UTF-8 means a 50% size increase for the
> > Chinese characters in the Basic Multilingual Plane (i.e., most of the
> > Chinese characters in the message) since as UTF-16, one Chinese
> > character is 16 bits, and as UTF-8, one Chinese character is three
> > bytes.
>
> Exactly - efficient representation of Unicode text currently sadly
> involves the user or the application doing a frequency analysis and
> deciding whether to use UTF-8 or UTF-16... I think very, very, few do
> this right now; UTF-8 seems the almost ubiquitous choice, mainly due to
> the software industry being driven from places that use the Roman alphabet.
>
> Perhaps we need a new UTF that loses many of UTF-8s nice properties with
> respect to lexical sorting and so on, but is less discriminatory against
> character sets that live far into the BMP, perhaps working along the
> lines of:
For a moment there, I thought you were inventing SCSU [1].
You might also be interested in BOCU-1 [2].
Regards,
Tony Graham
------------------------------------------------------------------------
XML Technology Center - Dublin
Sun Microsystems Ireland Ltd Phone: +353 1 8199708
Hamilton House, East Point Business Park, Dublin 3 x(70)19708
[1] http://www.unicode.org/reports/tr6/
[2] http://www.unicode.org/notes/tn6/
|