[
Lists Home |
Date Index |
Thread Index
]
On Wed, 2 Nov 2005, Philippe Poulard wrote:
> Elliotte Harold wrote:
> > Rick Jelliffe wrote:
> >
> >> For CJK (Chinese, Japanese, Korean) XML documents, where three (or six)
> >> bytes may be used by UTF-8 instead of UCS-16's two (or four), UTF-16
> >> files
> >> will usually be smaller.
> >
> >
> > First a correction: UTF-8 never uses six bytes for anything. The largest
> > UTF-8 character you'll ever see is 4 bytes wide.
> >
>
> hi,
>
> I read somewhere that :
>
> UTF-8 uses 6 bytes for ISO/IEC 10646
> UTF-8 uses 4 bytes for Unicode
>
> Unicode is a subset of ISO/IEC 10646 (in terms of addressing)
> ISO/IEC 10646 is a subset of Unicode (in terms of semantic)
>
> XML uses Unicode
10646 reserves the codes U+D800..U+DFFF for use in pairs to address
characters with codes up to 20-bits long (U-00010000..U-0010FFFF). These
reserved values (U+D800..U+DFFF) get encoded at 3 bytes each in UTF-8 so
it takes 6 bytes to address the values 17 to 20 bits long via the 10646
scheme. However, UTF-8 can encode the UNICODE values
U-00010000..U-0010FFFF as 4 bytes.
<http://czyborra.com/utf/> explains some of the details.
Chris Gray
University of Waterloo Library
|