OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Microsoft FUD on binary XML...

[ Lists Home | Date Index | Thread Index ]

Alaric B Snell <alaric@alaric-snell.com> wrote at Fri, 21 Nov 2003 13:36:24 +0000:
> Tony Graham wrote:
> > Changing UTF-16 Chinese to UTF-8 means a 50% size increase for the
> > Chinese characters in the Basic Multilingual Plane (i.e., most of the
> > Chinese characters in the message) since as UTF-16, one Chinese
> > character is 16 bits, and as UTF-8, one Chinese character is three
> > bytes.
> Exactly - efficient representation of Unicode text currently sadly 
> involves the user or the application doing a frequency analysis and 
> deciding whether to use UTF-8 or UTF-16... I think very, very, few do 
> this right now; UTF-8 seems the almost ubiquitous choice, mainly due to 
> the software industry being driven from places that use the Roman alphabet.
> Perhaps we need a new UTF that loses many of UTF-8s nice properties with 
> respect to lexical sorting and so on, but is less discriminatory against 
> character sets that live far into the BMP, perhaps working along the 
> lines of:

For a moment there, I thought you were inventing SCSU [1].

You might also be interested in BOCU-1 [2].


Tony Graham
XML Technology Center - Dublin
Sun Microsystems Ireland Ltd                       Phone: +353 1 8199708
Hamilton House, East Point Business Park, Dublin 3            x(70)19708

[1] http://www.unicode.org/reports/tr6/
[2] http://www.unicode.org/notes/tn6/


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS