[
Lists Home |
Date Index |
Thread Index
]
Alaric B Snell <alaric@alaric-snell.com> wrote at Fri, 21 Nov 2003 11:14:14 +0000:
> Rick Jelliffe wrote:
>
> > Also, it would interesting to see binary people use Chinese (Japanese or
> > Korean) text
> > and markup for their test data. Compressing or packing ASCII is quite
> > different to
> > compressing or packing UTF-16 Chinese, which has a more random-seeming
> > distribution
> > of byte values. It is not dishonest to make the case for binary using
> > data that
> > is most compressible; but businesses who are looking at compression
> > strategies
> > for world-wide use need to factor in CJK compressability into their
> > evaluations.
>
> That only makes a difference if you're actually compressing the text
> fields - most binary interchange formats will just write the text in
> UTF-8 and leave it at that; however lower-level byte sequence
Changing UTF-16 Chinese to UTF-8 means a 50% size increase for the
Chinese characters in the Basic Multilingual Plane (i.e., most of the
Chinese characters in the message) since as UTF-16, one Chinese
character is 16 bits, and as UTF-8, one Chinese character is three
bytes.
Only characters in the ASCII range take less space as UTF-8 than
UTF-16. It's 1:1 for € to ߿ and for 𐀀 and above,
but for ࠀ to  (excluding � to �), which
includes the most frequently used Chinese, Japanese, and Korean
characters, UTF-8 uses three bytes.
> compressors will just see the text as bytes rather than as characters.
> I've yet to see an implementation of the deflate algorithm (as used by
> gzip) for UCS-4 codepoints rather than just bytes, but it could be done
> and would be very interesting (but if you use a wide range of characters
> in the input, your Huffman tree will be a bit memory-intensive! :-)
Regards,
Tony Graham
------------------------------------------------------------------------
XML Technology Center - Dublin
Sun Microsystems Ireland Ltd Phone: +353 1 8199708
Hamilton House, East Point Business Park, Dublin 3 x(70)19708
|