xml-dev - Re: [xml-dev] Microsoft FUD on binary XML...

Re: [xml-dev] Microsoft FUD on binary XML...

[ Lists Home | Date Index | Thread Index ]

To: xml-dev@lists.xml.org
Subject: Re: [xml-dev] Microsoft FUD on binary XML...
From: Tony Graham <Tony.Graham@Sun.COM>
Date: Fri, 21 Nov 2003 12:11:52 +0000 (GMT)
In-reply-to: <3FBDF386.9090307@alaric-snell.com>
References: <004201c3af99$e9944010$650aa8c0@BOBDEV><3FBDAFDA.3010905@allette.com.au> <3FBDF386.9090307@alaric-snell.com>

Alaric B Snell <alaric@alaric-snell.com> wrote at Fri, 21 Nov 2003 11:14:14 +0000:
> Rick Jelliffe wrote:
> 
> > Also, it would interesting to see binary people use Chinese (Japanese or 
> > Korean) text
> > and markup for their test data.  Compressing or packing ASCII is quite 
> > different to
> > compressing or packing UTF-16 Chinese, which has a more random-seeming 
> > distribution
> > of byte values.  It is not dishonest to make the case for binary using 
> > data that
> > is most compressible; but businesses who are looking at compression 
> > strategies
> > for world-wide use need to factor in CJK compressability into their 
> > evaluations.
> 
> That only makes a difference if you're actually compressing the text 
> fields - most binary interchange formats will just write the text in 
> UTF-8 and leave it at that; however lower-level byte sequence 

Changing UTF-16 Chinese to UTF-8 means a 50% size increase for the
Chinese characters in the Basic Multilingual Plane (i.e., most of the
Chinese characters in the message) since as UTF-16, one Chinese
character is 16 bits, and as UTF-8, one Chinese character is three
bytes.

Only characters in the ASCII range take less space as UTF-8 than
UTF-16.  It's 1:1 for &#x80; to &#x7FF; and for &#x10000; and above,
but for &#x800; to &#xFFFF; (excluding &#xD800; to &#xDFFF;), which
includes the most frequently used Chinese, Japanese, and Korean
characters, UTF-8 uses three bytes.

> compressors will just see the text as bytes rather than as characters. 
> I've yet to see an implementation of the deflate algorithm (as used by 
> gzip) for UCS-4 codepoints rather than just bytes, but it could be done 
> and would be very interesting (but if you use a wide range of characters 
> in the input, your Huffman tree will be a bit memory-intensive! :-)

Regards,


Tony Graham
------------------------------------------------------------------------
XML Technology Center - Dublin
Sun Microsystems Ireland Ltd                       Phone: +353 1 8199708
Hamilton House, East Point Business Park, Dublin 3            x(70)19708

Follow-Ups:
- Re: [xml-dev] Microsoft FUD on binary XML...
  - From: Alaric B Snell <alaric@alaric-snell.com>

References:
- RE: [xml-dev] Microsoft FUD on binary XML...
  - From: "Bob Wyman" <bob@wyman.us>
- Re: [xml-dev] Microsoft FUD on binary XML...
  - From: Rick Jelliffe <ricko@allette.com.au>
- Re: [xml-dev] Microsoft FUD on binary XML...
  - From: Alaric B Snell <alaric@alaric-snell.com>

Prev by Date: Re: [xml-dev] Microsoft FUD on binary XML...
Next by Date: Re: [xml-dev] Microsoft FUD on binary XML...
Previous by thread: Re: [xml-dev] Microsoft FUD on binary XML...
Next by thread: Re: [xml-dev] Microsoft FUD on binary XML...
Index(es):
- Date
- Thread