[
Lists Home |
Date Index |
Thread Index
]
From: "Long, Craig Z" <craig.long@eds.com>
> One of the engineers here translates the hex as: <BirthCity>Koln</BirthCity>
> is this correct?
When looking at UTF-8 codes, there are a few easy rules you can apply for ASCII:
1) All ASCII characters (i.e. the characters on a US keyboard) are represented
by the same bytes in UTF-8 as in ASCII. So an ASCII string has exactly the same
bytes if it is UTF-8.
2) Moreover, there is only one way of coding those ASCII characters. So < does
not have two different encodings, one with three bytes and one with just a single
byte. *
3) Every byte that is less than 0x80 is the ASCII character. Multi-byte code
sequences have all their codes >= 0x80.
So three bytes all greater than 0xFF are not <.
Now it is also a little strange that the example given is Koln, not Köln.
Has the data been transliterated (i.e. to remove umlauts)? If so, that is
the stage that may have inroduced some problems. (I would have expected the
transliteration for Köln to be Koeln, if that is the German city.)
Cheers
Rick Jelliffe
* (However, there could be other, non-ASCII characters which look similar.
And there is also a really odd thing called "normalization" which may have some
impact too, but probably not here.)
|