OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] Use of UTF-8 and UTF-16

[ Lists Home | Date Index | Thread Index ]

Rick Jelliffe wrote:

> For CJK (Chinese, Japanese, Korean) XML documents, where three (or six)
> bytes may be used by UTF-8 instead of UCS-16's two (or four), UTF-16 files
> will usually be smaller.

First a correction: UTF-8 never uses six bytes for anything. The largest 
UTF-8 character you'll ever see is 4 bytes wide.

UTF-16 files may well be smaller, but it's not a sure thing. Even 
Chinese XML contains lots of ASCII characters such as <, >, &, =, ", and 
the space. Text heavy documents like novels and stories may well be 
smaller. Technical documents that also contain the digits 0-9 and other 
non-Chinese ASCII characters may even be larger in UTF-16. Either way, 
the size difference is not likely to be important. the reasons for 
choosing UTF-8 have little to do with size. See 
http://www-128.ibm.com/developerworks/xml/library/x-utf8/ for a slightly 
longer discussion of this issue.

-- 
Elliotte Rusty Harold  elharo@metalab.unc.edu
XML in a Nutshell 3rd Edition Just Published!
http://www.cafeconleche.org/books/xian3/
http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosim




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS