UTF-8 Question: e with acute accent should require two bytes, right?

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: "Costello, Roger L." <costello@mitre.org>
To: <xml-dev@lists.xml.org>
Date: Fri, 28 Sep 2007 11:12:52 -0400

Hi Folks,
 
Consider this element:
 
<title>My Resum�</title>

Notice: � (the character "e" with an acute accent). It is U-00E9

Since its code point is greater than U+0080, it requires more than one
byte. 

Hex E9 = Decimal 233.  This has the binary: 11101001

I believe that it is encoded in UTF-8 as two bytes:

  11000011 10101001

These bytes correspond to hex C3 and hex A9.

Thus, � should be encoded in UTF-8 as:

  C3A9

The code points of the other characters (My Resum) are all less than
U-0080, and so the UTF-8 encoding of those characters should be only
one byte.

So, this is what I believe should be the bytes:

 M y    R  e s  u m   �
4D79 2052 6573 756D C3A9

Do you agree?

However, when I view the bytes in my hex editor I get this:

 M y    R  e s  u m  �
4D79 2052 6573 756D E9

Notice that � uses only one byte.

Something is wrong.  Here's what I think may be wrong:
- the editor that I am using to display the hex values is displaying
the code points and not the hex values. However, I have now tried two
editors, and they both display the same thing (E9).  So perhaps the
editor isn't the problem.  Perhaps I'm the problem, and am
misunderstanding something.  Help!

/Roger

Follow-Ups:
- [Summary] UTF-8 Question: e with acute accent should require two bytes, right?
  - From: "Costello, Roger L." <costello@mitre.org>
- RE: [xml-dev] UTF-8 Question: e with acute accent should require twobytes, right?
  - From: "Waters, Michael, Springer US" <Mike.Waters@springer.com>
- Re: [xml-dev] UTF-8 Question: e with acute accent should requiretwo bytes, right?
  - From: Philippe Poulard <philippe.poulard@sophia.inria.fr>
- RE: [xml-dev] UTF-8 Question: e with acute accent should require two bytes, right?
  - From: "Michael Kay" <mike@saxonica.com>
- Re: [xml-dev] UTF-8 Question: e with acute accent should requiretwo bytes, right?
  - From: Jonathan Robie <jonathan.robie@redhat.com>
- Re: [xml-dev] UTF-8 Question: e with acute accent should require two bytes, right?
  - From: David Carlisle <davidc@nag.co.uk>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]