Lists Home |
Date Index |
On Tue, 2004-03-02 at 21:35, Andy Greener wrote:
> I'd appreciate some advice on the following issues...
> Being from the UK, we have a requirement to convey the UK pound-sterling
> character in XML documents (and validate those documents of course).
> The Unicode decimal value of pound sterling is 163 (0xA3), but of course
> the UTF-8 encoding is 0xC2A3.
> I'm ok with the fact that a UTF-8 encoded instance doc can contain the
> above two byte values directly (i.e. 0xC2 and 0xA3), but I'm getting
> conflicting opinion as to whether replacing those two bytes with the
> character entity £ is equivalent or not - I think not, so long as
> the document is UTF-8 encoded, though it would be correct to do this
You can always enter a character entity, no matter what the encoding.
However, an XML application, a parser, XML editor, or whatever, is free
to change the character entity to the equivalent UTF-8 value.
The character entity and the encoded value are considered equivalent
from the point of view of an XML application. They are both
representations of the same Unicode code point, i.e. 0xA3 in your
> if the encoding were "ISO-8859-1", as would inserting the actual pound
> character (ie the 8 bit value equivalent to 0xA3). However, I'm happy to
> be corrected.
If you use ISO-8859-1 then you must use character entities to represent
Unicode characters that are not present in the ISO-8859-1 character set.
If you use UTF-8 you can still enter character entities, but it is very
likely that your application will convert it to an UTF-8 encoded value
at some point.
>From your point of view, it should not matter how a character is
represented. As long as all the different pieces of software in you
application play by the rules, that is. They don't always do that...
> I guess the fundamental question is: how are character entities
> interpreted in relation to the document encoding (i.e. what's the
> order of evaluation)? If that's not the fundamental question then
> I'm missing something :-))
Character entities aren't interpreted in relation to the document
encoding, really. Character entities are mapped to Unicode code points.
Characters encoded in UTF-8 and other Unicode encodings are also mapped
to Unicode code points. This means that there is a relationship, of
course, but it is not a direct one. Rather, character entities and UTF-X
encoded characters share a common frame of reference, i.e the Unicode
I think a more important question is: why do you care? When your
application uses the two different representations of a pound sign, in
what way does the behavior of your application differ?
Provided that the applications you use work as they should, there should
not be any difference. You shouldn't have to bother with details of how
the pound sign, or any other Unicode character, is represented. It is
the domain of parsers, XML editors and other tools.
I have encountered two common causes of problems with Unicode:
The first is applications that are broken in some way. For example,
MSXML does not always play nice with encodings. (Or didn't. I haven't
used it in awhile.)
The other cause is people trying to fix what isn't broken. When getting
a first look at Unicode it is quite common to reel back in shock at the
complexity, and assume that something that looks like that just has to
be broken. The natural reaction is then to try and fix it. The result is
that problems are created. I've been down that route myself, so I have a
great deal of sympathy for others who react the same way.
> A supplementary question: if I want to validate text containing pound
> sterling characters, and my Schemas are UTF-8 encoded, what do I put in
> the pattern facet: £ or the two character UTF-8 encoding? And what
> will your average regular expression evaluator make of the latter?
It does not matter. Whatever your XML editor puts there is fine. If it
isn't, your editor is buggy.