OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] Microsoft FUD on binary XML...

[ Lists Home | Date Index | Thread Index ]

Alaric B Snell <alaric@alaric-snell.com> wrote at Fri, 21 Nov 2003 11:14:14 +0000:
> Rick Jelliffe wrote:
> 
> > Also, it would interesting to see binary people use Chinese (Japanese or 
> > Korean) text
> > and markup for their test data.  Compressing or packing ASCII is quite 
> > different to
> > compressing or packing UTF-16 Chinese, which has a more random-seeming 
> > distribution
> > of byte values.  It is not dishonest to make the case for binary using 
> > data that
> > is most compressible; but businesses who are looking at compression 
> > strategies
> > for world-wide use need to factor in CJK compressability into their 
> > evaluations.
> 
> That only makes a difference if you're actually compressing the text 
> fields - most binary interchange formats will just write the text in 
> UTF-8 and leave it at that; however lower-level byte sequence 

Changing UTF-16 Chinese to UTF-8 means a 50% size increase for the
Chinese characters in the Basic Multilingual Plane (i.e., most of the
Chinese characters in the message) since as UTF-16, one Chinese
character is 16 bits, and as UTF-8, one Chinese character is three
bytes.

Only characters in the ASCII range take less space as UTF-8 than
UTF-16.  It's 1:1 for &#x80; to &#x7FF; and for &#x10000; and above,
but for &#x800; to &#xFFFF; (excluding &#xD800; to &#xDFFF;), which
includes the most frequently used Chinese, Japanese, and Korean
characters, UTF-8 uses three bytes.

> compressors will just see the text as bytes rather than as characters. 
> I've yet to see an implementation of the deflate algorithm (as used by 
> gzip) for UCS-4 codepoints rather than just bytes, but it could be done 
> and would be very interesting (but if you use a wide range of characters 
> in the input, your Huffman tree will be a bit memory-intensive! :-)

Regards,


Tony Graham
------------------------------------------------------------------------
XML Technology Center - Dublin
Sun Microsystems Ireland Ltd                       Phone: +353 1 8199708
Hamilton House, East Point Business Park, Dublin 3            x(70)19708




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS