XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Is it a well-formedness error to use a character notin the encoding specified by the XML declaration?

On Fri, 2010-03-19 at 09:55 +1100, Greg Hunt wrote:
> Is a substitution character (x'1a' in many single byte character sets
> or 65533 in UTF-8) a legal character?  I have a case where x'1a'
> appears not be to legal in a document with an encoding specified as
> ISO-8859-1.

WHen the encoding is ISO 8859-1, individual bytes ("octets" as
standards people often say, in case someone starts making 9-bit
computers again), individual bytes are read by the xML parser,
and mapped from ISO 8859-1 into Unicode. Numerical character
references like  are always taken as Unicode numbers.

Having said that, as others pointed out, 0x1a (decimal 26, ASCII SUB)
is never allowed in an XML document unquoted, and you can only use
 or  in XML 1.1 -- but since its meaning is not well-defined,
you should not do this.

The most common reason people think they want to do this :-) is that
they have in fact some other character set, such as one of the Windows
"code pages", using some of the characters between 0 and 32 for actual
characters, rather than as device control codes. In that case, you
need to set the encoding correctly, or to use a conversion utility
such as (on Linux) iconv.

The other thing that can happen is that an http server sends a
charset parameter e.g. of windows-1252, but the Web browser ignores
this, and does not pass it to its XML parser. The charser
parameter was originally (as Mike Kay mentioned) supposed to
override the encoding in the document, but this turns out to be
a disaster. For this reason, application/xml (which does not
allow an intermediate proxy to rewrite the data) is preferred
these days over text/xml for use with MIME in HTTP and email.

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org www.advogato.org



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS