xml-dev - Re: [xml-dev] Microsoft FUD on binary XML...

Re: [xml-dev] Microsoft FUD on binary XML...

[ Lists Home | Date Index | Thread Index ]

To: Tony Graham <Tony.Graham@Sun.COM>
Subject: Re: [xml-dev] Microsoft FUD on binary XML...
From: Alaric B Snell <alaric@alaric-snell.com>
Date: Fri, 21 Nov 2003 13:36:24 +0000
Cc: xml-dev@lists.xml.org
In-reply-to: <20031121.121152.50253888.Tony.Graham@Sun.COM>
References: <004201c3af99$e9944010$650aa8c0@BOBDEV> <3FBDAFDA.3010905@allette.com.au> <3FBDF386.9090307@alaric-snell.com> <20031121.121152.50253888.Tony.Graham@Sun.COM>
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030704 Debian/1.4-1

Tony Graham wrote:

> Changing UTF-16 Chinese to UTF-8 means a 50% size increase for the
> Chinese characters in the Basic Multilingual Plane (i.e., most of the
> Chinese characters in the message) since as UTF-16, one Chinese
> character is 16 bits, and as UTF-8, one Chinese character is three
> bytes.

Exactly - efficient representation of Unicode text currently sadly 
involves the user or the application doing a frequency analysis and 
deciding whether to use UTF-8 or UTF-16... I think very, very, few do 
this right now; UTF-8 seems the almost ubiquitous choice, mainly due to 
the software industry being driven from places that use the Roman alphabet.

Perhaps we need a new UTF that loses many of UTF-8s nice properties with 
respect to lexical sorting and so on, but is less discriminatory against 
character sets that live far into the BMP, perhaps working along the 
lines of:

Code points 0..127 represented as-is.

Code points 128+ represented by switching mode; to start a sequence of 
up to 128 wide characters, output a byte consisting of 128 + (length-1), 
then that many UTF-16 characters (in network byte order).

Plus some canonicalisation requirements, like the system must not have 
two sequences of wide characters next to each other unless the first one 
is 128 characters long (so there is no choice in how you split up blocks 
of more than 128 wide characters; you must output sequences of 128 
characters until there are less than 128 left).

That way text that was all out of the 0..127 range would only be 
penalised by an extra byte per 256 bytes (128 characters). Pure US-ASCII 
would still come out as pure US-ASCII so it'd be readable in legacy viewers.

People who use pound signs and accented characters, like us Europeans, 
would see each such symbol taking 3 bytes, but they currently take 2 
bytes in UTF-8 and occur only occasionally interspersed with US-ASCII 
characters anyway, so the hit would be nowhere near as bad as the hit 
UTF-8 incurs for the Chinese and their neighbours.

> 
> Regards,
> 
> 
> Tony Graham

ABS

Follow-Ups:
- Re: [xml-dev] Microsoft FUD on binary XML...
  - From: Elliotte Rusty Harold <elharo@metalab.unc.edu>
- Re: [xml-dev] Microsoft FUD on binary XML...
  - From: Tony Graham <Tony.Graham@Sun.COM>

References:
- RE: [xml-dev] Microsoft FUD on binary XML...
  - From: "Bob Wyman" <bob@wyman.us>
- Re: [xml-dev] Microsoft FUD on binary XML...
  - From: Rick Jelliffe <ricko@allette.com.au>
- Re: [xml-dev] Microsoft FUD on binary XML...
  - From: Alaric B Snell <alaric@alaric-snell.com>
- Re: [xml-dev] Microsoft FUD on binary XML...
  - From: Tony Graham <Tony.Graham@Sun.COM>

Prev by Date: Re: [xml-dev] Microsoft FUD on binary XML...
Next by Date: Re: [xml-dev] Microsoft FUD on binary XML...
Previous by thread: Re: [xml-dev] Microsoft FUD on binary XML...
Next by thread: Re: [xml-dev] Microsoft FUD on binary XML...
Index(es):
- Date
- Thread