OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] Use of UTF-8 and UTF-16

[ Lists Home | Date Index | Thread Index ]

On Wed, 2 Nov 2005, Philippe Poulard wrote:

> Elliotte Harold wrote:
> > Rick Jelliffe wrote:
> >
> >> For CJK (Chinese, Japanese, Korean) XML documents, where three (or six)
> >> bytes may be used by UTF-8 instead of UCS-16's two (or four), UTF-16
> >> files
> >> will usually be smaller.
> >
> >
> > First a correction: UTF-8 never uses six bytes for anything. The largest
> > UTF-8 character you'll ever see is 4 bytes wide.
> >
>
> hi,
>
> I read somewhere that :
>
> UTF-8 uses 6 bytes for ISO/IEC 10646
> UTF-8 uses 4 bytes for Unicode
>
> Unicode is a subset of ISO/IEC 10646 (in terms of addressing)
> ISO/IEC 10646 is a subset of Unicode (in terms of semantic)
>
> XML uses Unicode

10646 reserves the codes U+D800..U+DFFF for use in pairs to address
characters with codes up to 20-bits long (U-00010000..U-0010FFFF).  These
reserved values (U+D800..U+DFFF) get encoded at 3 bytes each in UTF-8 so
it takes 6 bytes to address the values 17 to 20 bits long via the 10646
scheme.  However, UTF-8 can encode the UNICODE values
U-00010000..U-0010FFFF as 4 bytes.

<http://czyborra.com/utf/> explains some of the details.

Chris Gray
University of Waterloo Library




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS