[
Lists Home |
Date Index |
Thread Index
]
Elliotte Rusty Harold wrote:
> One should keep in mind that Chinese and similar languages are quite
> compressed to start with, far more so than English text is. For
> example, in UTF-8
> the English word "tree" takes four bytes. The Japanese word for tree
> takes three
> bytes. Word for word, Chinese documents tend to be smaller, even in
> UTF-8.
Sure, but the point is that anyone who says "Look at how much size
reduction we can
get with our binary/compression system!" (i.e., on documents with
significant text portions)
should be shouted at "You figures are for ASCII data and markup, please
come back
when you have figures that also demonstrate the characteristics for
non-Latin data and
non-Latin markup."
Similarly, we should largely ignore all benchmarks which do not include
at least 50% of
document data in non-Latin scripts. If someone is making a test suite or
a sample
to allow a benchmark index to be created for comparison purposes, I suggest
something like the following mix would be useful:
25% ASCII text (English, Bahasa, etc)
25% Accented Latin (French, German, Polish, etc)
25% CJK (including at least 5% traditional chinese, 5% simplified, 5%
Japanese,
5% Korean)
25% Other, e.g. any mix of Greek, Russian, Indic, Arabic, Hebrew
And where about half of the each group of non-ASCII samples use
non-Latin characters
in markup.
Just because ideographs are terser than alphabetic letters does not mean
that there
is any less value to their users in compressing them. UTF-8 has not
proved popular
in CJK countries AFAIKS because of the 50% penalty compared to regional
encodings: transmission and storage size is always important.
Non-Latin requirements in general, and CJK requirements in particular,
should not be an afterthought for benchmarking, crumbs given to the dogs
under the
table after we have finished our feast IYKWIM. I am sure that no-one
thinks that way,
but the issue deserves to be raised: people always assume that the
particular issues
they face are universal. I have not finished reading all the papers
from the W3C
meeting, but I have not seen any mention of this issue so far. Maybe
everyone
is just shipping around numbers?
My recollection from some conference is that writers (of Trad Chinese
and Japanese)
rarely have more than a 3000-character vocabulary (even if just because
people
write about a topic, so there are many words that won't appear in the same
discourse as others: "crepuscular" probably doesn't appear in any
military tank
manual (pedants get googling now!), nor "kangaroo" in books on US
quasi-extra-terratorial
military legal practise (though maybe it should).
Cheers
Rick Jelliffe
|