Re: [xml-dev] There is a serious amount of character encoding conversion

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Re: [xml-dev] There is a serious amount of character encoding conversionsoccurring inside our computers and on the Web

From: Michael Sokolov <sokolov@ifactory.com>
To: "Costello, Roger L." <costello@mitre.org>
Date: Fri, 28 Dec 2012 11:34:52 -0500

On 12/28/2012 9:01 AM, Costello, Roger L. wrote:
> How did it find a match?
>
> The underlying byte sequence for the iso-8859-1 L�pez is: 4C F3 70 65 7A (one byte -- F3 -- is used to encode �).
>
> The underlying byte sequence for the UTF-8 L�pez is: 4C C3 B3 70 65 7A (two bytes -- C3 B3 -- are used to encode �).
>
> The search application cannot be doing a byte-for-byte match, else it would find no match.
>
> The codepoint for the UTF-8 � character is F3.
>
> Hey, iso-8859-1 uses F3 to encode �.
>
> So perhaps the search application is converting the UTF-8 bytes to codepoints and then comparing those codepoints to the iso-8859-1 bytes. That would result in a match.
>
One point of comparison: Lucene used to use Java characters internally  
(which are much like UTF-16), and now uses UTF-8 internally (not 
codepoints).  I think it's unlikely that your search application is 
using iso-8859-1 internally, although it might be using codepoints, as 
you suggest.  Of course it's no accident that iso-8859-1=Unicode 
codepoint; that was one sensible thing done by the character encoding gurus.

-Mike

References:
- There is a serious amount of character encoding conversionsoccurring inside our computers and on the Web
  - From: "Costello, Roger L." <costello@mitre.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]