OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Is it a well-formedness error to use a character not in the encoding specified by the XML declaration?

I can assure you that I don't WANT to put these characters in.   What I'm asking about is the mapping from the ASCII substitution character to the Unicode one.

The issue seems to be that the various transcodings involved in the collection of systems I'm involved with involved take UTF-8 at one point and later convert it to 8859-1 when the data arrives at a system from an earlier generation of technology which then wants to parse the 8859 string as XML.  Some characters end up with a value of hex 1a on the way into 8859-1 encoding (they are converted to the substitution character for the character set) - in practice these are dashes and some kinds of quotes, but also potentially anything in a non-latin character set gets turned into 1a.  Putting an encoding specification on the XML of 8859-1 does not appear to allow the XML to parse.  I suspect that the 8859 substitution character (1a) is not getting mapped to the (valid for XML) UTF-8 substitution character (FFFD) by the XML parser's transcoding.

The XML spec does not appear to address this specific issue, but I may have missed something.   It looks like the Unicode substitution character is legal but substitution characters from other character sets are not guaranteed to map to the Unicode one when the text is converted.

Unfortunately I don't have a development box to play with at the moment to work on this further.  I don't know whether I'm looking at a bug or correct behaviour.


On Fri, Mar 19, 2010 at 11:25 AM, Liam R E Quin <liam@w3.org> wrote:
On Fri, 2010-03-19 at 09:55 +1100, Greg Hunt wrote:
> Is a substitution character (x'1a' in many single byte character sets
> or 65533 in UTF-8) a legal character?  I have a case where x'1a'
> appears not be to legal in a document with an encoding specified as
> ISO-8859-1.

WHen the encoding is ISO 8859-1, individual bytes ("octets" as
standards people often say, in case someone starts making 9-bit
computers again), individual bytes are read by the xML parser,
and mapped from ISO 8859-1 into Unicode. Numerical character
references like &#x1a; are always taken as Unicode numbers.

Having said that, as others pointed out, 0x1a (decimal 26, ASCII SUB)
is never allowed in an XML document unquoted, and you can only use
&#26; or &#x1a; in XML 1.1 -- but since its meaning is not well-defined,
you should not do this.

The most common reason people think they want to do this :-) is that
they have in fact some other character set, such as one of the Windows
"code pages", using some of the characters between 0 and 32 for actual
characters, rather than as device control codes. In that case, you
need to set the encoding correctly, or to use a conversion utility
such as (on Linux) iconv.

The other thing that can happen is that an http server sends a
charset parameter e.g. of windows-1252, but the Web browser ignores
this, and does not pass it to its XML parser. The charser
parameter was originally (as Mike Kay mentioned) supposed to
override the encoding in the document, but this turns out to be
a disaster. For this reason, application/xml (which does not
allow an intermediate proxy to rewrite the data) is preferred
these days over text/xml for use with MIME in HTTP and email.


Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org www.advogato.org

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS