xml-dev - Re: [xml-dev] UTF-8 use with XML

Re: [xml-dev] UTF-8 use with XML

[ Lists Home | Date Index | Thread Index ]

To: <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] UTF-8 use with XML
From: "Rick Jelliffe" <ricko@allette.com.au>
Date: Sun, 15 Jun 2003 21:28:20 +1000
References: <1093E8BAEB6AD411997300508BDF0A7C14CA56EE@USHEM204>

From: "Long, Craig Z" <craig.long@eds.com>

> One of the engineers here translates the hex as: <BirthCity>Koln</BirthCity>
> is this correct? 

When looking at UTF-8 codes, there are a few easy rules you can apply for ASCII:

1) All ASCII characters (i.e. the characters on a US keyboard) are represented
by the same bytes in UTF-8 as in ASCII.  So an ASCII string has exactly the same
bytes if it is UTF-8.  

2) Moreover, there is only one way of coding those ASCII characters. So < does
not have two different encodings, one with three bytes and one with just a single
byte. *

3) Every byte that is less than 0x80 is the ASCII character. Multi-byte code
sequences have all their codes >= 0x80.  

So three bytes all greater than 0xFF are not <.    

Now it is also a little strange that the example given is Koln, not K&ouml;ln. 
Has the data been transliterated (i.e. to remove umlauts)? If so, that is 
the stage that may have inroduced some problems. (I would have expected the 
transliteration for K&ouml;ln to be Koeln, if that is the German city.) 


Cheers
Rick Jelliffe

* (However, there could be other, non-ASCII characters which look similar.
And there is also a really odd thing called "normalization" which may have some
impact too, but probably not here.)

References:
- RE: [xml-dev] UTF-8 use with XML
  - From: "Long, Craig Z" <craig.long@eds.com>

Prev by Date: Re: [xml-dev] Expressing mathematical relationships?
Next by Date: Re: [xml-dev] modeling, validating and documenting an xml grammar
Previous by thread: Re: [xml-dev] UTF-8 use with XML
Next by thread: RE: [xml-dev] high speed Web services binary attachment protocol
Index(es):
- Date
- Thread