[
Lists Home |
Date Index |
Thread Index
]
At 1:36 PM +0000 11/21/03, Alaric B Snell wrote:
>People who use pound signs and accented characters, like us
>Europeans, would see each such symbol taking 3 bytes, but they
>currently take 2 bytes in UTF-8 and occur only occasionally
>interspersed with US-ASCII characters anyway, so the hit would be
>nowhere near as bad as the hit UTF-8 incurs for the Chinese and
>their neighbours.
>
One should keep in mind that Chinese and similar languages are quite
compressed to start with, far more so than English text is. For
example, in UTF-8 the English word "tree" takes four bytes. The
Japanese word for tree takes three bytes. The English word "grove"
takes five bytes. The Japanese word for grove takes three bytes. The
English word "forest" takes six bytes. The Japanese word for forest
still takes only three bytes. I don't know the Japanese word for
antidisestablishmentarianism, but whatever it is, it's probably a lot
smaller than the English one. Comparing alphabetic languages to
ideographic ones is really apples to oranges. Word for word, Chinese
documents tend to be smaller, even in UTF-8.
--
Elliotte Rusty Harold
elharo@metalab.unc.edu
Effective XML (Addison-Wesley, 2003)
http://www.cafeconleche.org/books/effectivexml
http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA
|