[
Lists Home |
Date Index |
Thread Index
]
Rick Jelliffe wrote:
> For CJK (Chinese, Japanese, Korean) XML documents, where three (or six)
> bytes may be used by UTF-8 instead of UCS-16's two (or four), UTF-16 files
> will usually be smaller.
First a correction: UTF-8 never uses six bytes for anything. The largest
UTF-8 character you'll ever see is 4 bytes wide.
UTF-16 files may well be smaller, but it's not a sure thing. Even
Chinese XML contains lots of ASCII characters such as <, >, &, =, ", and
the space. Text heavy documents like novels and stories may well be
smaller. Technical documents that also contain the digits 0-9 and other
non-Chinese ASCII characters may even be larger in UTF-16. Either way,
the size difference is not likely to be important. the reasons for
choosing UTF-8 have little to do with size. See
http://www-128.ibm.com/developerworks/xml/library/x-utf8/ for a slightly
longer discussion of this issue.
--
Elliotte Rusty Harold elharo@metalab.unc.edu
XML in a Nutshell 3rd Edition Just Published!
http://www.cafeconleche.org/books/xian3/
http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosim
|