XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Is it a well-formedness error to use a character not in the encoding specified by the XML declaration?

> Rick,
> Unicode tech reports 22 and 36 both describe transcoding producing both 1A
> and FFFD characters as a result of character mismatches depending on
> context
> and direction.  It appears to me that 1a can be introduced when
> transcoding
> either into or out of Unicode, but this is not my area of specialisation.

> Could you point me at where the XML standard says that transcoding
> problems
> that result in the introduction of substitution characters into transcoded
> text should "cause processing to report an error"? .I had a look for
> exactly
> this earlier and must have missed it.  The W3C document seems to leave
> transcoding issues to the Unicode standards.  U+FFFD is apparently a valid
> XML character so there should be no issue with processing it.  .

There are two issues:

1) What should an XML processor do when faced with a bad byte sequence?

The answer is very clear: s4.3.3.
"It is a fatal error  if an XML entity is determined (via default,
encoding declaration, or higher-level protocol) to be in a certain
encoding but contains byte sequences that are not legal in that encoding.
"

2) Is the character FFFD allowed in data?

Again, the answer is very clear: s2.2

[2]   	Char	   ::=   	#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]

So as I said, XML can have U+FFFD in data, but not put there by a transcoder.

So I don't think it is correct behaviour to fall back to any character,
including U+FFFD, especially silently. Silently failure undercuts XML
approach.

(I will modify this: however, an implementation could choose to put in a
SUB or FFFD or any other signal anywhere it likes, as long as it is clear
that the DOM or stream or whatever is not WF XML and there has been a
fatal error. But this is not something "allowed" by XML or Unicode,
because by this stage you don't have XML.)

On the issue of what to do if you are using some magical encoding has
characters that are not in Unicode, it is a really specialist topic and
should not be confused with the general case. (There are a few CJK
dictionary character repertoires which have more characters than Unicode,
for example. However, these are not in any off-the-shelf transcoders so it
is not this case.)

Cheers
Rick Jelliffe



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS