Lists Home |
Date Index |
Elliotte Rusty Harold wrote:
> One should keep in mind that Chinese and similar languages are quite
> compressed to start with, far more so than English text is. For example,
> in UTF-8 the English word "tree" takes four bytes. The Japanese word for
> tree takes three bytes.
Good point, actually... I suppose that, in general, any language which
uses more than 256 code points in general use is actually quite likely
to be a language that uses one code point per word. So languages like
Arabic, which are alphabet-based but not very compact in UTF-8 due to
being composed of high-numbered characters (although I'm not sure how
high so don't know if they would mainly be 2 or 3 bytes or whatever),
would be better served by an encoding that mainly uses a shiftable
window with single-byte characters, I guess.