RE: [xml-dev] [Summary] UTF-8 Question: e with acute accent should req

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

RE: [xml-dev] [Summary] UTF-8 Question: e with acute accent should require two bytes, right?

From: "Alessandro Triglia" <sandro@mclink.it>
To: "'Amelia A Lewis'" <amyzing@talsever.com>,<xml-dev@lists.xml.org>
Date: Sat, 29 Sep 2007 14:40:27 -0400

> -----Original Message-----
> From: Amelia A Lewis [mailto:amyzing@talsever.com] 
> Sent: Saturday, September 29, 2007 13:59
> To: xml-dev@lists.xml.org
> Subject: RE: [xml-dev] [Summary] UTF-8 Question: e with acute 
> accent should require two bytes, right?
> 
> On 2007-09-29 10:51:36 -0400 "Michael Kay" <mike@saxonica.com> wrote:
> >> I read "ASCII character" in a similar way as I read 
> "TCP/IP packet" 
> >> or "SOAP envelope" or "HTTP header".  Perhaps other people read it 
> >> differently.
> > No, I read it the same.
> > 
> > I think that an ASCII character is a Unicode character in 
> the same way 
> > that an XML document is an SGML document. One thing can conform to 
> > more than one description.
> 
> We were speaking specifically of "ASCII" and "UTF-8", no?
> 
> The ASCII character set is a proper subset of UTF-8 (and a 
> proper subset of ISO-8859-x, and of several other encoding 
> schemes).  Identical bit-patterns identify identical characters.
> 
> So I agree that it is over-precise, tending toward confusion, 
> to claim that the "A" in UTF-8 encoding is something 
> different from "A" in ASCII encoding, or from "A" in 
> ISO8859-1, -2, -8, or whatever, since *the design of those 
> larger character repertoires deliberately and consciously 
> intended to leave the ASCII subset unchanged.*  And 
> consequently it is perfectly correct to say that "A" is an 
> ASCII character, but � is not.  (In this email, if I recall 
> how I set up the client correctly, the latter is a UTF-8 
> encoded Latin capital A with acute accent; while this 
> character is also found in the repertoire of ISO8859-1, it is 
> encoded differently so that it is far more justifiable to 
> claim that it is in some sense a "different" character (it 
> is, at least, a different encoding of the character)).

I don't see what "UTF-8 character" could mean other than a "(Unicode)
character encoded in UTF-8".  Some people even mean a single byte of such,
and lots of people (I just did a Google search) when they say "ASCII
character" they actually refer to the bit patterns as well as to the
character represented by that bit pattern.  So some people refer to the
character, some people refer to the encoding, and some people refer to both.
I think that since the ASCII standard specifies both, the two aspects of an
ASCII character are inseparable as long as one uses the phrase "ASCII
character".  So I disagree that the word "ASCII" as an adjective means
"supported in ASCII" to everybody.  It means something closer to "supported
in ASCII and encoded in this way".  That will match UTF-8 but not Unicode in
general.

Alessandro

Follow-Ups:
- Re: [xml-dev] [Summary] UTF-8 Question: e with acute accent should require two bytes, right?
  - From: richard@inf.ed.ac.uk (Richard Tobin)

References:
- RE: [xml-dev] [Summary] UTF-8 Question: e with acute accent should require two bytes, right?
  - From: "Michael Kay" <mike@saxonica.com>
- RE: [xml-dev] [Summary] UTF-8 Question: e with acute accent should require two bytes, right?
  - From: Amelia A Lewis <amyzing@talsever.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]