xml-dev - Re: UTF-8 vs UTF-16...?

Re: UTF-8 vs UTF-16...?

[ Lists Home | Date Index | Thread Index ]

From: kragen@pobox.com (Kragen Sitaker)
To: xml-dev@ic.ac.uk
Date: Wed, 17 Nov 1999 13:10:13 -0500 (EST)

According to the latest Unicode book (is it version 2.0? Or 3.0?)
UTF-8 does not allow you to encode more than the first 17 planes of ISO
10646. If I remember correctly, the formats are (omitting leading
output zero bits):

one byte:
0xxxxxxx -> xxxxxxx
two bytes:
110yyyyy 10xxxxxx -> yyy yyxxxxxx
three bytes:
1110zzzz 10yyyyyy 10xxxxxx -> zzzzyyyy yyxxxxxx
four bytes:
11110uuu 10uuzzzz 10yyyyyy 10xxxxxx -> wwwww zzzzyyyy yyxxxxxx
where wwwww is uuuu+1. (These characters are encoded with surrogate
pairs in UTF-16.) I may be mistaken about this one; my book is at home.

No five-byte or longer sequences are listed. No valid sequences
starting with more than four ones are listed. Presumably these two
omissions correspond, and an extended UTF-8 with these additions would
allow you to handle larger character sets.

It may be that other standards actually specify such an extended UTF-8.

So "bigger character range" is probably not a valid reason for wanting
to use UTF-8 -- quite aside from the question of whether you really
need more than the million or so characters UTF-16 can encode --
because UTF-8 decoders implemented according to Unicode's spec will
choke if you try to encode bigger characters in it.

--
<kragen@pobox.com> Kragen Sitaker <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08. Hurrah!
<URL:http://www.pobox.com/~kragen/bubble.html>

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)

Follow-Ups:
- Re: UTF-8 vs UTF-16...?
  - From: David Brownell <david-b@pacbell.net>

Prev by Date: Re: UTF-8 vs UTF-16...? (Was: Feeling good about SML)
Next by Date: RE: Feeling good about SML
Previous by thread: Announce: IBM XSL Editor
Next by thread: Re: UTF-8 vs UTF-16...?
Index(es):
- Date
- Thread