OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Is it a well-formedness error to use a character not in the encoding specified by the XML declaration?

Unicode tech reports 22 and 36 both describe transcoding producing both 1A and FFFD characters as a result of character mismatches depending on context and direction.  It appears to me that 1a can be introduced when transcoding either into or out of Unicode, but this is not my area of specialisation.

Could you point me at where the XML standard says that transcoding problems that result in the introduction of substitution characters into transcoded text should "cause processing to report an error"? .I had a look for exactly this earlier and must have missed it.  The W3C document seems to leave transcoding issues to the Unicode standards.  U+FFFD is apparently a valid XML character so there should be no issue with processing it.  .


On Fri, Mar 19, 2010 at 6:17 PM, Rick Jelliffe <rjelliffe@allette.com.au> wrote:
On Roger's initial question about an XML processor failing to report a non-ASCII code sequence, this is not at all impossible. In fact, most transcoders that were made before XML or independently of consideration of XML's requirements do not report wrong codes, unless they get seriously in trouble. They may substitute some bogus character, or strip out the character, or even silently strip the character out; sometimes they will actually use the default encoding of the platform (if it is an ASCII superset at the encoding level.)
These kind of transcoders are not sufficient for use in XML WF detection.  The general character set infrastructure of our software systems started off broken and it is only by taking care that anything will work in this area: the standards must have good enough policies, the users must implement these policies in their markup/configuration, the transcoder libraries must be chosen to implement the policies, and other sources of information about bad encodings (e.g. the presence of disallowed control characters) must be utilized to try to fill in any gaps. The world is full of programmers determined to remain ignorant of basic working knowledge of character encoding issues and to complicate the life of people downstream.

On Greg's question about the ASCII SUB character: this is a control character intended to be used for transmission level problems: the encoding relates to signals on wires when transmitting ASCII, not to transcoding mismatches, as I understand it. (The Wikipedia entry incorrectly states that this is to be used for signalling that the following character needs to NOT the 5th bit as an escape issue. I think this may be the EBCDIC operation?  Anyway, see
http://www.itscj.ipsj.or.jp/ISO-IR/001.pdf  )

The correct Unicode character would not be U+001A SUB but U+FFFD REPLACEMENT CHARACTER, however, because of XML's rules, transcoding errors should cause processing to report an error. In other words, if  U+FFFD were to appear in a WF document, it should only be because there was some pre-existing text which had that character in it that was then marked up: in other words, the data correctly contains the REPLACEMENT CHARACTER due to some prior flaw.   (See http://www.unicode.org/versions/Unicode5.2.0/ch16.pdf  and search for FFFD.)

Note that Unicode does not define semantics for SUB and other control characters, but  defers to implementations and other standards, such as IS6429:1992: you can see the front matter at  http://webstore.iec.ch/preview/info_isoiec6429%7Bed3.0%7Den.pdf  that the scope of that standard is (page 1) intended to be used "in particular with character-imaging devices": think Teletype printer's BEL and BS and by a stretch modem's X-on/Xoff flow control. It isn't for use in data exchange as part fo the data but for simple transmission protocols underneath the data.

Finally, Greg should note that the correct transcoding from UTF-8 to ISO8859-1 is not to use any substitution characters, but 1) to replace the character with numeric character entities when the item is in data content, and 2) to fail when the character is in markup.  If you need more detailed transcoding than that, then it is not something that XML processors will provide, and you will have to make your own preprocessor.

Now there have been multiple character set formats: indeed, RTF allows sections in different embedded encodings. The result is that you want to use a text editor, it must be 8-bit clean (not do any transcoding) and you have to change the screen encoding to view different sections correctly. XML did not take this route.

Rick Jelliffe

P.S. The most common transcoding error I used to see is where there is a UTF-8 data stream and someone puts in the byte xA0, intending it to be the non-breaking space character. More common now is where there is a UTF-8 stream that has the UTF-16 Byte Order Mark converted to UTF-8 rather than stripped (this is not so much a code error as an operational error).


XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS