xml-dev - Re: [xml-dev] Use of UTF-8 and UTF-16

Re: [xml-dev] Use of UTF-8 and UTF-16

[ Lists Home | Date Index | Thread Index ]

To: Philippe Poulard <Philippe.Poulard@sophia.inria.fr>
Subject: Re: [xml-dev] Use of UTF-8 and UTF-16
From: Chris Gray <cpgray@library.uwaterloo.ca>
Date: Wed, 2 Nov 2005 12:43:28 -0500 (EST)
Cc: Elliotte Harold <elharo@metalab.unc.edu>, Rick Jelliffe <rjelliffe@allette.com.au>, Xml-Dev <xml-dev@lists.xml.org>
In-reply-to: <4368C783.3070008@sophia.inria.fr>
References: <NBBBIBMKFOFCNEBAKDPLCEKIKDAA.xml-dev@boynings.co.uk><39325.203.51.20.11.1130759065.squirrel@intranet.allette.com.au><43660951.3090707@metalab.unc.edu> <4368C783.3070008@sophia.inria.fr>

On Wed, 2 Nov 2005, Philippe Poulard wrote:

> Elliotte Harold wrote:
> > Rick Jelliffe wrote:
> >
> >> For CJK (Chinese, Japanese, Korean) XML documents, where three (or six)
> >> bytes may be used by UTF-8 instead of UCS-16's two (or four), UTF-16
> >> files
> >> will usually be smaller.
> >
> >
> > First a correction: UTF-8 never uses six bytes for anything. The largest
> > UTF-8 character you'll ever see is 4 bytes wide.
> >
>
> hi,
>
> I read somewhere that :
>
> UTF-8 uses 6 bytes for ISO/IEC 10646
> UTF-8 uses 4 bytes for Unicode
>
> Unicode is a subset of ISO/IEC 10646 (in terms of addressing)
> ISO/IEC 10646 is a subset of Unicode (in terms of semantic)
>
> XML uses Unicode

10646 reserves the codes U+D800..U+DFFF for use in pairs to address
characters with codes up to 20-bits long (U-00010000..U-0010FFFF).  These
reserved values (U+D800..U+DFFF) get encoded at 3 bytes each in UTF-8 so
it takes 6 bytes to address the values 17 to 20 bits long via the 10646
scheme.  However, UTF-8 can encode the UNICODE values
U-00010000..U-0010FFFF as 4 bytes.

<http://czyborra.com/utf/> explains some of the details.

Chris Gray
University of Waterloo Library

References:
- Re: [xml-dev] Use of UTF-8 and UTF-16
  - From: Philippe Poulard <Philippe.Poulard@sophia.inria.fr>

Prev by Date: Xml2PDF version 2.4 is released
Next by Date: RE: [xml-dev] RE: description of the logical or semantic structure
Previous by thread: Re: [xml-dev] Use of UTF-8 and UTF-16
Next by thread: RE: [xml-dev] RE: description of the logical or semantic structure
Index(es):
- Date
- Thread