RE: [xml-dev] UTF-8 Question: e with acute accent should require twobyte

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

RE: [xml-dev] UTF-8 Question: e with acute accent should require twobytes, right?

From: "Waters, Michael, Springer US" <Mike.Waters@springer.com>
To: "Costello, Roger L." <costello@mitre.org>, xml-dev@lists.xml.org
Date: Fri, 28 Sep 2007 12:51:56 -0400

> Notice: � (the character "e" with an acute accent). It is U-00E9
> 
> Since its code point is greater than U+0080, it requires more than one
> byte. 

It depends. In ISO 8859-1 (Latin-1) and Windows-1252 (the default for many editors), only 1 byte is required: 0xE9.

> Thus, � should be encoded in UTF-8 as:
> 
>   C3A9

Yes.

> Something is wrong.  Here's what I think may be wrong:
> - the editor that I am using to display the hex values is displaying
> the code points and not the hex values. However, I have now tried two
> editors, and they both display the same thing (E9).

PSPad has 2 methods to invoke a hex view of a file, giving somewhat different results:

1. Open the file in the default Text Editor mode, then switch to View/Hex Edit Mode. Here, encoding conversions are coming into play, when switching views of the "bytes in memory."

2. Open the file directly in the Hex Editor, by selecting File/Open in Hex Editor. In this mode you get a better view of the "bytes on disk" without encoding conversions. When I come across encoding problems, this is the view that I use.

Perhaps the editors you've tried don't have the second type of hex view, which I think is what you want.

Mike Waters

References:
- UTF-8 Question: e with acute accent should require two bytes, right?
  - From: "Costello, Roger L." <costello@mitre.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]