OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Is it a well-formedness error to use a character notin the encoding specified by the XML declaration?

On Fri, 2010-03-19 at 13:59 +1100, Greg Hunt wrote:
> Liam,
> I can assure you that I don't WANT to put these characters in.

I did put a smiley there :-)

>    What I'm asking about is the mapping from the ASCII substitution
> character to the Unicode one.


>   I suspect that the 8859 substitution character (1a) is not getting
> mapped to the (valid for XML) UTF-8 substitution character (FFFD) by
> the XML parser's transcoding.

I think that ASCII SUB isn't quite the same as Unicode Substitute:
SUB (which is also in Unicode) indicates that the following character
is from a different character set; Substitute appears to replace the
character altogether [1].

There is nothing like the SUB mechanism for XML directly, because it's
poorly defined (_which_ other character set?) and because in XML you'd
normally use named character entities in this circumstance... althouth
XML punts on the values of the replacement text. We thought we were
going to work on SGML-style "SDATA" entities shortly after XML was
published, more than a decade ago....

At any rate, XML does not allow such control characters.  I'd suggest
using an external tool to map them to the private use area in UTF8,
either using an entity reference or a numeric character reference,
no the literal character, so that your XML is 8-bit clean and will
work in an ISO 8859-1 environment.

You could use "tr" or "sed" on a Unix or Linux system.

> Unfortunately I don't have a development box to play with at the
> moment to work on this further.  I don't know whether I'm looking at a
> bug or correct behaviour.

I don't think software needs to change SUB in converting from UTF-8 to
ISO 8859-1, since it has the same meaning in both, so I don't think it's
a bug. I think it would probably be a mistake to convert it to
Substitute, but I'd need to delve into the Unicode report to give a
better answer.  At any rate this sort of chicanery is not expected in
XML files -- the XML answer is that you should use explicit markup.

[1] http://www.interfacebus.com/ASCII_Table.html has a short summary,
    although there's obviously a typo in the entry for SUB.
    SUB is actually a safer mechanism than shift-in/shift-out, because
    it only affects the single next character (octet).


Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org www.advogato.org

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS