OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Microsoft FUD on binary XML...

[ Lists Home | Date Index | Thread Index ]

Elliotte Rusty Harold wrote:

> One should keep in mind that Chinese and similar languages are quite 
> compressed to start with, far more so than English text is. For example, 
> in UTF-8 the English word "tree" takes four bytes. The Japanese word for 
> tree takes three bytes. 

Good point, actually... I suppose that, in general, any language which 
uses more than 256 code points in general use is actually quite likely 
to be a language that uses one code point per word. So languages like 
Arabic, which are alphabet-based but not very compact in UTF-8 due to 
being composed of high-numbered characters (although I'm not sure how 
high so don't know if they would mainly be 2 or 3 bytes or whatever), 
would be better served by an encoding that mainly uses a shiftable 
window with single-byte characters, I guess.



News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS