OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: UTF-8 vs UTF-16...? (Was: Feeling good about SML)

[ Lists Home | Date Index | Thread Index ]
  • From: Tony Graham <tgraham@mulberrytech.com>
  • To: xml-dev@ic.ac.uk
  • Date: Wed, 17 Nov 1999 10:51:56 -0400 (EST)

At 17 Nov 1999 14:29 GMT, Steve Schafer wrote:
 > On 17 Nov 1999 13:24:27 +0100, you wrote:
 > >Not sure if I understand the UTF-16 bit above, but I'm reading this:
 > >        <URL:http://www.unicode.org/unicode/faq/#UTF-16 and UCS-4>
 > >to UTF-16 being able to represent the full UCS-4, which is what you
 > >say UTF-8 can do, if I interpret you correctly...?
 > Section C.3 of the Unicode 2.0 spec, paragraph 4:
 > "UTF-16 does not support the representation of all the UCS-4 code
 > space but is limited to the BMP and the next 16 planes...."

True, but that's more code values than anybody expects to ever
standardise (although that's the opinion of the same people that
thought that they'd never need more than the BMP).

All of the currently defined Unicode and ISO/IEC 10646 characters
(both people define the same characters) are in the BMP.  It won't be
long until characters are defined in Plane 1 and Plane 2 (with
possible spill-over into Plane 3), plus planes 15 and 16 are reserved
for private use.

Currently the only thing defined for the characters beyond Plane 16 of
Group 00 (i.e. beyond the characters addressable with UTF-16) are more
areas available for private use.

The fuss over UTF-8 or UTF-16 is over the number of bytes used to
represent the characters in the BMP, i.e. the currently defined
characters.  UTF-16 uses two bytes per character, and UTF-8 uses one
byte per character for the ASCII characters, two bytes per character
for not that many more characters, and three bytes per character for
most of the characters in the BMP.  Both UTF-8 and UTF-16 use four
bytes per character to represent the characters in planes 1 to 16.

(There's also UTF-32, which is four bytes per character for all the
characters that you can represent with UTF-16.)

UTF-8 is efficient if you use a lot of ASCII, e.g. if you're an
English speaker and all you use is ASCII, but it's more bytes per
character than UTF-16 for a whole lot of other scripts (plus it's more
bytes per character than an lot of current script-specific encodings).

So the issue isn't how many characters the different encodings can
represent, but how efficiently (or how uniformly) they represent the
currently defined characters.


Tony Graham
Tony Graham                            mailto:tgraham@mulberrytech.com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9632
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
  Mulberry Technologies: A Consultancy Specializing in SGML and XML

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS