[
Lists Home |
Date Index |
Thread Index
]
- From: kragen@pobox.com (Kragen Sitaker)
- To: xml-dev@ic.ac.uk
- Date: Wed, 17 Nov 1999 13:10:13 -0500 (EST)
According to the latest Unicode book (is it version 2.0? Or 3.0?)
UTF-8 does not allow you to encode more than the first 17 planes of ISO
10646. If I remember correctly, the formats are (omitting leading
output zero bits):
one byte:
0xxxxxxx -> xxxxxxx
two bytes:
110yyyyy 10xxxxxx -> yyy yyxxxxxx
three bytes:
1110zzzz 10yyyyyy 10xxxxxx -> zzzzyyyy yyxxxxxx
four bytes:
11110uuu 10uuzzzz 10yyyyyy 10xxxxxx -> wwwww zzzzyyyy yyxxxxxx
where wwwww is uuuu+1. (These characters are encoded with surrogate
pairs in UTF-16.) I may be mistaken about this one; my book is at home.
No five-byte or longer sequences are listed. No valid sequences
starting with more than four ones are listed. Presumably these two
omissions correspond, and an extended UTF-8 with these additions would
allow you to handle larger character sets.
It may be that other standards actually specify such an extended UTF-8.
So "bigger character range" is probably not a valid reason for wanting
to use UTF-8 -- quite aside from the question of whether you really
need more than the million or so characters UTF-16 can encode --
because UTF-8 decoders implemented according to Unicode's spec will
choke if you try to encode bigger characters in it.
--
<kragen@pobox.com> Kragen Sitaker <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08. Hurrah!
<URL:http://www.pobox.com/~kragen/bubble.html>
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)
|