[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] There is a serious amount of character encoding conversionsoccurring inside our computers and on the Web
- From: Michael Sokolov <sokolov@ifactory.com>
- To: "Costello, Roger L." <costello@mitre.org>
- Date: Fri, 28 Dec 2012 11:34:52 -0500
On 12/28/2012 9:01 AM, Costello, Roger L. wrote:
> How did it find a match?
>
> The underlying byte sequence for the iso-8859-1 López is: 4C F3 70 65 7A (one byte -- F3 -- is used to encode ó).
>
> The underlying byte sequence for the UTF-8 López is: 4C C3 B3 70 65 7A (two bytes -- C3 B3 -- are used to encode ó).
>
> The search application cannot be doing a byte-for-byte match, else it would find no match.
>
> The codepoint for the UTF-8 ó character is F3.
>
> Hey, iso-8859-1 uses F3 to encode ó.
>
> So perhaps the search application is converting the UTF-8 bytes to codepoints and then comparing those codepoints to the iso-8859-1 bytes. That would result in a match.
>
One point of comparison: Lucene used to use Java characters internally
(which are much like UTF-16), and now uses UTF-8 internally (not
codepoints). I think it's unlikely that your search application is
using iso-8859-1 internally, although it might be using codepoints, as
you suggest. Of course it's no accident that iso-8859-1=Unicode
codepoint; that was one sensible thing done by the character encoding gurus.
-Mike
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]