OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Microsoft FUD on binary XML...

[ Lists Home | Date Index | Thread Index ]

Tony Graham wrote:

> Changing UTF-16 Chinese to UTF-8 means a 50% size increase for the
> Chinese characters in the Basic Multilingual Plane (i.e., most of the
> Chinese characters in the message) since as UTF-16, one Chinese
> character is 16 bits, and as UTF-8, one Chinese character is three
> bytes.

Exactly - efficient representation of Unicode text currently sadly 
involves the user or the application doing a frequency analysis and 
deciding whether to use UTF-8 or UTF-16... I think very, very, few do 
this right now; UTF-8 seems the almost ubiquitous choice, mainly due to 
the software industry being driven from places that use the Roman alphabet.

Perhaps we need a new UTF that loses many of UTF-8s nice properties with 
respect to lexical sorting and so on, but is less discriminatory against 
character sets that live far into the BMP, perhaps working along the 
lines of:

Code points 0..127 represented as-is.

Code points 128+ represented by switching mode; to start a sequence of 
up to 128 wide characters, output a byte consisting of 128 + (length-1), 
then that many UTF-16 characters (in network byte order).

Plus some canonicalisation requirements, like the system must not have 
two sequences of wide characters next to each other unless the first one 
is 128 characters long (so there is no choice in how you split up blocks 
of more than 128 wide characters; you must output sequences of 128 
characters until there are less than 128 left).

That way text that was all out of the 0..127 range would only be 
penalised by an extra byte per 256 bytes (128 characters). Pure US-ASCII 
would still come out as pure US-ASCII so it'd be readable in legacy viewers.

People who use pound signs and accented characters, like us Europeans, 
would see each such symbol taking 3 bytes, but they currently take 2 
bytes in UTF-8 and occur only occasionally interspersed with US-ASCII 
characters anyway, so the hit would be nowhere near as bad as the hit 
UTF-8 incurs for the Chinese and their neighbours.

> Regards,
> Tony Graham



News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS