OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] UTF-8 use with XML

[ Lists Home | Date Index | Thread Index ]

From: "Long, Craig Z" <craig.long@eds.com>

> One of the engineers here translates the hex as: <BirthCity>Koln</BirthCity>
> is this correct? 

When looking at UTF-8 codes, there are a few easy rules you can apply for ASCII:

1) All ASCII characters (i.e. the characters on a US keyboard) are represented
by the same bytes in UTF-8 as in ASCII.  So an ASCII string has exactly the same
bytes if it is UTF-8.  

2) Moreover, there is only one way of coding those ASCII characters. So < does
not have two different encodings, one with three bytes and one with just a single
byte. *

3) Every byte that is less than 0x80 is the ASCII character. Multi-byte code
sequences have all their codes >= 0x80.  

So three bytes all greater than 0xFF are not <.    

Now it is also a little strange that the example given is Koln, not K&ouml;ln. 
Has the data been transliterated (i.e. to remove umlauts)? If so, that is 
the stage that may have inroduced some problems. (I would have expected the 
transliteration for K&ouml;ln to be Koeln, if that is the German city.) 

Rick Jelliffe

* (However, there could be other, non-ASCII characters which look similar.
And there is also a really odd thing called "normalization" which may have some
impact too, but probably not here.)


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS