OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Microsoft FUD on binary XML...

[ Lists Home | Date Index | Thread Index ]

Elliotte Rusty Harold wrote:

> One should keep in mind that Chinese and similar languages are quite
> compressed to start with, far more so than English text is. For 
> example, in UTF-8
> the English word "tree" takes four bytes. The Japanese word for tree 
> takes three
> bytes.  Word for word, Chinese documents tend to be smaller, even in 
> UTF-8.

Sure, but the point is that anyone who says "Look at how much size 
reduction we can
get with our binary/compression system!" (i.e., on documents with 
significant text portions)
should be shouted at "You figures are for ASCII data and markup, please 
come back
when you have figures that also demonstrate the characteristics for 
non-Latin data and
non-Latin markup."

Similarly, we should largely ignore all benchmarks which do not include 
at least 50% of
document data in non-Latin scripts. If someone is making a test suite or 
a sample
to allow a benchmark index to be created for comparison purposes, I suggest
something like the following mix would be useful:
  25% ASCII text (English, Bahasa, etc)
  25% Accented Latin (French, German, Polish, etc)
  25% CJK (including at least 5% traditional chinese, 5% simplified, 5% 
5% Korean)
  25% Other, e.g. any mix of Greek, Russian, Indic, Arabic, Hebrew

And where about half of the each group of non-ASCII samples use 
non-Latin characters
in  markup.

Just because ideographs are terser than alphabetic letters does not mean 
that there
is any less value to their users in compressing them.  UTF-8 has not 
proved popular
in CJK countries AFAIKS because of the 50% penalty compared to regional
encodings: transmission and storage size is always important.

Non-Latin requirements in general, and CJK requirements in particular,
should not be an afterthought for benchmarking, crumbs given to the dogs 
under the
table after we have finished our feast IYKWIM. I am sure that no-one 
thinks that way,
but the issue deserves to be raised: people always assume that the 
particular issues
they face are universal.  I have not finished reading all the papers 
from the W3C
meeting, but I have not seen any mention of this issue so far. Maybe 
is just shipping around numbers?

My recollection from some conference is that writers (of Trad Chinese 
and Japanese)
rarely have more than a 3000-character vocabulary (even if just because 
write about a topic, so there are many words that won't appear in the same
discourse as others: "crepuscular" probably doesn't appear in any 
military tank
manual (pedants get googling now!), nor "kangaroo" in books on US 
military legal practise (though maybe it should).

Rick Jelliffe


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS