OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] Microsoft FUD on binary XML...

[ Lists Home | Date Index | Thread Index ]

At 1:36 PM +0000 11/21/03, Alaric B Snell wrote:

>People who use pound signs and accented characters, like us 
>Europeans, would see each such symbol taking 3 bytes, but they 
>currently take 2 bytes in UTF-8 and occur only occasionally 
>interspersed with US-ASCII characters anyway, so the hit would be 
>nowhere near as bad as the hit UTF-8 incurs for the Chinese and 
>their neighbours.
>

One should keep in mind that Chinese and similar languages are quite 
compressed to start with, far more so than English text is. For 
example, in UTF-8 the English word "tree" takes four bytes. The 
Japanese word for tree takes three bytes.  The English word "grove" 
takes five bytes. The Japanese word for grove takes three bytes. The 
English word "forest" takes six bytes. The Japanese word for forest 
still takes only three bytes. I don't know the Japanese word for 
antidisestablishmentarianism, but whatever it is, it's probably a lot 
smaller than the English one. Comparing alphabetic languages to 
ideographic ones is really apples to oranges. Word for word, Chinese 
documents tend to be smaller, even in UTF-8.
-- 

   Elliotte Rusty Harold
   elharo@metalab.unc.edu
   Effective XML (Addison-Wesley, 2003)
   http://www.cafeconleche.org/books/effectivexml
   http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS