xml-dev - Re: [xml-dev] Microsoft FUD on binary XML...

Re: [xml-dev] Microsoft FUD on binary XML...

[ Lists Home | Date Index | Thread Index ]

To: xml-dev@lists.xml.org
Subject: Re: [xml-dev] Microsoft FUD on binary XML...
From: Rick Jelliffe <ricko@allette.com.au>
Date: Mon, 24 Nov 2003 11:59:44 +1100
In-reply-to: <p06010201bbe4835029d4@[192.168.254.4]>
References: <004201c3af99$e9944010$650aa8c0@BOBDEV> <3FBDAFDA.3010905@allette.com.au> <3FBDF386.9090307@alaric-snell.com> <20031121.121152.50253888.Tony.Graham@Sun.COM> <3FBE14D8.7040405@alaric-snell.com> <p06010201bbe4835029d4@[192.168.254.4]>
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3.1) Gecko/20030428

Elliotte Rusty Harold wrote:

> One should keep in mind that Chinese and similar languages are quite
> compressed to start with, far more so than English text is. For 
> example, in UTF-8
> the English word "tree" takes four bytes. The Japanese word for tree 
> takes three
> bytes.  Word for word, Chinese documents tend to be smaller, even in 
> UTF-8.

Sure, but the point is that anyone who says "Look at how much size 
reduction we can
get with our binary/compression system!" (i.e., on documents with 
significant text portions)
should be shouted at "You figures are for ASCII data and markup, please 
come back
when you have figures that also demonstrate the characteristics for 
non-Latin data and
non-Latin markup."

Similarly, we should largely ignore all benchmarks which do not include 
at least 50% of
document data in non-Latin scripts. If someone is making a test suite or 
a sample
to allow a benchmark index to be created for comparison purposes, I suggest
something like the following mix would be useful:
  25% ASCII text (English, Bahasa, etc)
  25% Accented Latin (French, German, Polish, etc)
  25% CJK (including at least 5% traditional chinese, 5% simplified, 5% 
Japanese,
5% Korean)
  25% Other, e.g. any mix of Greek, Russian, Indic, Arabic, Hebrew

And where about half of the each group of non-ASCII samples use 
non-Latin characters
in  markup.

Just because ideographs are terser than alphabetic letters does not mean 
that there
is any less value to their users in compressing them.  UTF-8 has not 
proved popular
in CJK countries AFAIKS because of the 50% penalty compared to regional
encodings: transmission and storage size is always important.

Non-Latin requirements in general, and CJK requirements in particular,
should not be an afterthought for benchmarking, crumbs given to the dogs 
under the
table after we have finished our feast IYKWIM. I am sure that no-one 
thinks that way,
but the issue deserves to be raised: people always assume that the 
particular issues
they face are universal.  I have not finished reading all the papers 
from the W3C
meeting, but I have not seen any mention of this issue so far. Maybe 
everyone
is just shipping around numbers?

My recollection from some conference is that writers (of Trad Chinese 
and Japanese)
rarely have more than a 3000-character vocabulary (even if just because 
people
write about a topic, so there are many words that won't appear in the same
discourse as others: "crepuscular" probably doesn't appear in any 
military tank
manual (pedants get googling now!), nor "kangaroo" in books on US 
quasi-extra-terratorial
military legal practise (though maybe it should).

Cheers
Rick Jelliffe

Follow-Ups:
- Re: [xml-dev] Microsoft FUD on binary XML...
  - From: "Thomas B. Passin" <tpassin@comcast.net>

References:
- RE: [xml-dev] Microsoft FUD on binary XML...
  - From: "Bob Wyman" <bob@wyman.us>
- Re: [xml-dev] Microsoft FUD on binary XML...
  - From: Rick Jelliffe <ricko@allette.com.au>
- Re: [xml-dev] Microsoft FUD on binary XML...
  - From: Alaric B Snell <alaric@alaric-snell.com>
- Re: [xml-dev] Microsoft FUD on binary XML...
  - From: Tony Graham <Tony.Graham@Sun.COM>
- Re: [xml-dev] Microsoft FUD on binary XML...
  - From: Alaric B Snell <alaric@alaric-snell.com>
- Re: [xml-dev] Microsoft FUD on binary XML...
  - From: Elliotte Rusty Harold <elharo@metalab.unc.edu>

Prev by Date: RE: [xml-dev] Microsoft FUD on binary XML...
Next by Date: [offtopic] Re: [xml-dev] Microsoft FUD on binary XML...
Previous by thread: Off topic - Korean language
Next by thread: Re: [xml-dev] Microsoft FUD on binary XML...
Index(es):
- Date
- Thread