Lists Home |
Date Index |
Tony Graham wrote:
> Changing UTF-16 Chinese to UTF-8 means a 50% size increase for the
> Chinese characters in the Basic Multilingual Plane (i.e., most of the
> Chinese characters in the message) since as UTF-16, one Chinese
> character is 16 bits, and as UTF-8, one Chinese character is three
Exactly - efficient representation of Unicode text currently sadly
involves the user or the application doing a frequency analysis and
deciding whether to use UTF-8 or UTF-16... I think very, very, few do
this right now; UTF-8 seems the almost ubiquitous choice, mainly due to
the software industry being driven from places that use the Roman alphabet.
Perhaps we need a new UTF that loses many of UTF-8s nice properties with
respect to lexical sorting and so on, but is less discriminatory against
character sets that live far into the BMP, perhaps working along the
Code points 0..127 represented as-is.
Code points 128+ represented by switching mode; to start a sequence of
up to 128 wide characters, output a byte consisting of 128 + (length-1),
then that many UTF-16 characters (in network byte order).
Plus some canonicalisation requirements, like the system must not have
two sequences of wide characters next to each other unless the first one
is 128 characters long (so there is no choice in how you split up blocks
of more than 128 wide characters; you must output sequences of 128
characters until there are less than 128 left).
That way text that was all out of the 0..127 range would only be
penalised by an extra byte per 256 bytes (128 characters). Pure US-ASCII
would still come out as pure US-ASCII so it'd be readable in legacy viewers.
People who use pound signs and accented characters, like us Europeans,
would see each such symbol taking 3 bytes, but they currently take 2
bytes in UTF-8 and occur only occasionally interspersed with US-ASCII
characters anyway, so the hit would be nowhere near as bad as the hit
UTF-8 incurs for the Chinese and their neighbours.
> Tony Graham