Re: [xml-dev] Is it a well-formedness error to use a character notin th

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Re: [xml-dev] Is it a well-formedness error to use a character notin the encoding specified by the XML declaration?

From: Rick Jelliffe <rjelliffe@allette.com.au>
To: xml-dev@lists.xml.org
Date: Fri, 19 Mar 2010 18:17:37 +1100

On Roger's initial question about an XML processor failing to report a 
non-ASCII code sequence, this is not at all impossible. In fact, most 
transcoders that were made before XML or independently of consideration 
of XML's requirements do not report wrong codes, unless they get 
seriously in trouble. They may substitute some bogus character, or strip 
out the character, or even silently strip the character out; sometimes 
they will actually use the default encoding of the platform (if it is an 
ASCII superset at the encoding level.) 

These kind of transcoders are not sufficient for use in XML WF 
detection.  The general character set infrastructure of our software 
systems started off broken and it is only by taking care that anything 
will work in this area: the standards must have good enough policies, 
the users must implement these policies in their markup/configuration, 
the transcoder libraries must be chosen to implement the policies, and 
other sources of information about bad encodings (e.g. the presence of 
disallowed control characters) must be utilized to try to fill in any 
gaps. The world is full of programmers determined to remain ignorant of 
basic working knowledge of character encoding issues and to complicate 
the life of people downstream.

On Greg's question about the ASCII SUB character: this is a control 
character intended to be used for transmission level problems: the 
encoding relates to signals on wires when transmitting ASCII, not to 
transcoding mismatches, as I understand it. (The Wikipedia entry 
incorrectly states that this is to be used for signalling that the 
following character needs to NOT the 5th bit as an escape issue. I think 
this may be the EBCDIC operation?  Anyway, see
 http://www.itscj.ipsj.or.jp/ISO-IR/001.pdf  )

The correct Unicode character would not be U+001A SUB but U+FFFD 
REPLACEMENT CHARACTER, however, because of XML's rules, transcoding 
errors should cause processing to report an error. In other words, if  
U+FFFD were to appear in a WF document, it should only be because there 
was some pre-existing text which had that character in it that was then 
marked up: in other words, the data correctly contains the REPLACEMENT 
CHARACTER due to some prior flaw.   (See 
http://www.unicode.org/versions/Unicode5.2.0/ch16.pdf  and search for 
FFFD.)

Note that Unicode does not define semantics for SUB and other control 
characters, but  defers to implementations and other standards, such as 
IS6429:1992: you can see the front matter at  
http://webstore.iec.ch/preview/info_isoiec6429%7Bed3.0%7Den.pdf  that 
the scope of that standard is (page 1) intended to be used "in 
particular with character-imaging devices": think Teletype printer's BEL 
and BS and by a stretch modem's X-on/Xoff flow control. It isn't for use 
in data exchange as part fo the data but for simple transmission 
protocols underneath the data.

Finally, Greg should note that the correct transcoding from UTF-8 to 
ISO8859-1 is not to use any substitution characters, but 1) to replace 
the character with numeric character entities when the item is in data 
content, and 2) to fail when the character is in markup.  If you need 
more detailed transcoding than that, then it is not something that XML 
processors will provide, and you will have to make your own preprocessor.

Now there have been multiple character set formats: indeed, RTF allows 
sections in different embedded encodings. The result is that you want to 
use a text editor, it must be 8-bit clean (not do any transcoding) and 
you have to change the screen encoding to view different sections 
correctly. XML did not take this route.

Cheers
Rick Jelliffe

P.S. The most common transcoding error I used to see is where there is a 
UTF-8 data stream and someone puts in the byte xA0, intending it to be 
the non-breaking space character. More common now is where there is a 
UTF-8 stream that has the UTF-16 Byte Order Mark converted to UTF-8 
rather than stripped (this is not so much a code error as an operational 
error).

Follow-Ups:
- Re: [xml-dev] Is it a well-formedness error to use a character not in the encoding specified by the XML declaration?
  - From: Greg Hunt <greg@firmansyah.com>

References:
- Is it a well-formedness error to use a character not in theencoding specified by the XML declaration?
  - From: "Costello, Roger L." <costello@mitre.org>
- Re: [xml-dev] Is it a well-formedness error to use a character not in theencoding specified by the XML declaration?
  - From: Michael Glavassevich <mrglavas@ca.ibm.com>
- RE: [xml-dev] Is it a well-formedness error to use a character not in the encoding specified by the XML declaration?
  - From: "Michael Kay" <mike@saxonica.com>
- Re: [xml-dev] Is it a well-formedness error to use a character not in the encoding specified by the XML declaration?
  - From: Greg Hunt <greg@firmansyah.com>
- Re: [xml-dev] Is it a well-formedness error to use a character notin the encoding specified by the XML declaration?
  - From: Liam R E Quin <liam@w3.org>
- Re: [xml-dev] Is it a well-formedness error to use a character not in the encoding specified by the XML declaration?
  - From: Greg Hunt <greg@firmansyah.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]