OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   UTF-8

[ Lists Home | Date Index | Thread Index ]
  • From: emberson@faslab.com (Richard Emberson)
  • To: xml-dev@ic.ac.uk
  • Date: Fri, 16 Oct 1998 15:48:38 -0700

Does the UTF-8 encoding require that the minimum byte count
be used when a character is encoded.
Recall that the form of a UTF-8 encoding is:

 0xxxxxxx
 110xxxxx 10xxxxxx
 1110xxxx 10xxxxxx 10xxxxxx
 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx


So one could, for example, claim that:

 00111111

and

 11000000 10111111

represent the same character, #x3F, or

 11110001 10111111 10111111 10111111

and

 11111000 10000001 10111111 10111111 10111111

represent #x7FFFF (note: x10000 < x7FFFF < x10FFFF as so is legal).

The reason I ask is whether an XML parser has to worry about 
5 and 6 byte UTF-8 encodings or can it *allways* assume that the
values represented by such encoding are not legal unicode characters.

Thanks.

Richard Emberson

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS