Re: [xml-dev] [Summary] UTF-8 Question: e with acute accentshould requi

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Re: [xml-dev] [Summary] UTF-8 Question: e with acute accentshould require two bytes, right?

From: Rick Jelliffe <rjelliffe@allette.com.au>
To: xml-dev@lists.xml.org
Date: Mon, 01 Oct 2007 14:13:14 +1000

Internationalization experts, who need precision in order to be clear
about their meaning when discussing things, tend to use the following
terms distinctly:

* Character repertoire: unordered bag of characters. E.g. Latin 1
repertoire.

* Coded character set (CCS): ordered set of characters: one or more
repertoire mapped to numbers (usually but not always distinct numbers.)
E.g. ISO 646-US

* Character encoding scheme (CES): a function that gives a sequence of
bytes for a string of characters from a character set (or from multiple
character sets in the case of escaped encodings.) E.g. UTF-8

* Higher order protocol: e.g. XML numeric character references.

So "character" is only used either to mean
* the thing that is the same between a repertoire, CCS and CES, or
* character in a particular repertoire, CCS or CES.

Two terms that are rarely used, or used condescendingly or
pedagogically, are ASCII and ANSI (the character repertoire/set/encoding
scheme) for several reasons. Obviously for a start because "ANSI" is not
from ANSI. And also because ASCII has regional variants, so very often
it is IS646 that is meant, and so ISO646-US is used to be clear which of
the ASCII-family is being meant. (In other words,
English-speaking-country people use ASCII to mean two different
concepts: 7-bit clean strings (which could be any IS646 variant) and
actual ASCII characters.) But perhaps primarily ASCII and ANSI are
avoided because they come from a time before the three-fold distinction
above was widely accepted. Sometimes people use US-ASCII rather then ISO
646-US or IS646-US (http://en.wikipedia.org/wiki/Character_encoding is
good.)

Another term that is rarly used is plain "Character set", because no-one
knows whether you mean repertoire, CCS or CES. And so most material on
the web and even in standards that is before 1990 (and perhaps even
1999) is terribly confused in terminology. Originally Unicode was a 16
bit CES (UCS-2) but now it is the CCS and UTF-* are the CES, for
example.

People interested in studying this should look at Dan Connolly's
"Charset considered harmful"
http://www.w3.org/MarkUp/html-spec/charset-harmful.html
The XML encoding declaration is "encoding" not "charset" on purpose.

It probably goes without saying on this forum, but there is also "ASCII"
considered as a set of glyphs (e.g. an "ASCII font"). People who want to
get up to speed on the character issue might well start with the ISO
document
http://standards.iso.org/ittf/PubliclyAvailableStandards/c027163_ISO_IEC_TR_15285_1998(E).zip

So what is the point of this? That any discussions on characters other
than trivial ones do well to explicitly state whether character is being
used as a member of a repertoire, a code point in a CCS, or a byte
sequence from a CES, or whatever. Roger's question was clearly about CES
and responses in terms of repertoire and CES, though interesting, are
surely tangential.

So ISO 646-US (e.g. ASCII) as a repertoire is a subset of the ISO 10646
repertoire. And as a CCS it is a subset of the Unicode CCS. And as a CES
it is a subset of the UTF-8 CES.

Cheers
Rick Jelliffe

P.S. Even the three-fold repertoire/CCS/CES distinction is not really
good enough for every case. However, to get more complicated drowns us
in the sea of details rather than rescuing us.

Follow-Ups:
- RE: [xml-dev] [Summary] UTF-8 Question: e with acute accentshould require two bytes, right?
  - From: "Alessandro Triglia" <sandro@mclink.it>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]