[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
UTF-8 Question: e with acute accent should require two bytes, right?
- From: "Costello, Roger L." <costello@mitre.org>
- To: <xml-dev@lists.xml.org>
- Date: Fri, 28 Sep 2007 11:12:52 -0400
Hi Folks,
Consider this element:
<title>My Resumé</title>
Notice: é (the character "e" with an acute accent). It is U-00E9
Since its code point is greater than U+0080, it requires more than one
byte.
Hex E9 = Decimal 233. This has the binary: 11101001
I believe that it is encoded in UTF-8 as two bytes:
11000011 10101001
These bytes correspond to hex C3 and hex A9.
Thus, é should be encoded in UTF-8 as:
C3A9
The code points of the other characters (My Resum) are all less than
U-0080, and so the UTF-8 encoding of those characters should be only
one byte.
So, this is what I believe should be the bytes:
M y R e s u m é
4D79 2052 6573 756D C3A9
Do you agree?
However, when I view the bytes in my hex editor I get this:
M y R e s u m é
4D79 2052 6573 756D E9
Notice that é uses only one byte.
Something is wrong. Here's what I think may be wrong:
- the editor that I am using to display the hex values is displaying
the code points and not the hex values. However, I have now tried two
editors, and they both display the same thing (E9). So perhaps the
editor isn't the problem. Perhaps I'm the problem, and am
misunderstanding something. Help!
/Roger
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]