[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Is it a well-formedness error to use a character notin the encoding specified by the XML declaration?
- From: Rick Jelliffe <rjelliffe@allette.com.au>
- To: xml-dev@lists.xml.org
- Date: Fri, 19 Mar 2010 18:17:37 +1100
On Roger's initial question about an XML processor failing to report a
non-ASCII code sequence, this is not at all impossible. In fact, most
transcoders that were made before XML or independently of consideration
of XML's requirements do not report wrong codes, unless they get
seriously in trouble. They may substitute some bogus character, or strip
out the character, or even silently strip the character out; sometimes
they will actually use the default encoding of the platform (if it is an
ASCII superset at the encoding level.)
These kind of transcoders are not sufficient for use in XML WF
detection. The general character set infrastructure of our software
systems started off broken and it is only by taking care that anything
will work in this area: the standards must have good enough policies,
the users must implement these policies in their markup/configuration,
the transcoder libraries must be chosen to implement the policies, and
other sources of information about bad encodings (e.g. the presence of
disallowed control characters) must be utilized to try to fill in any
gaps. The world is full of programmers determined to remain ignorant of
basic working knowledge of character encoding issues and to complicate
the life of people downstream.
On Greg's question about the ASCII SUB character: this is a control
character intended to be used for transmission level problems: the
encoding relates to signals on wires when transmitting ASCII, not to
transcoding mismatches, as I understand it. (The Wikipedia entry
incorrectly states that this is to be used for signalling that the
following character needs to NOT the 5th bit as an escape issue. I think
this may be the EBCDIC operation? Anyway, see
http://www.itscj.ipsj.or.jp/ISO-IR/001.pdf )
The correct Unicode character would not be U+001A SUB but U+FFFD
REPLACEMENT CHARACTER, however, because of XML's rules, transcoding
errors should cause processing to report an error. In other words, if
U+FFFD were to appear in a WF document, it should only be because there
was some pre-existing text which had that character in it that was then
marked up: in other words, the data correctly contains the REPLACEMENT
CHARACTER due to some prior flaw. (See
http://www.unicode.org/versions/Unicode5.2.0/ch16.pdf and search for
FFFD.)
Note that Unicode does not define semantics for SUB and other control
characters, but defers to implementations and other standards, such as
IS6429:1992: you can see the front matter at
http://webstore.iec.ch/preview/info_isoiec6429%7Bed3.0%7Den.pdf that
the scope of that standard is (page 1) intended to be used "in
particular with character-imaging devices": think Teletype printer's BEL
and BS and by a stretch modem's X-on/Xoff flow control. It isn't for use
in data exchange as part fo the data but for simple transmission
protocols underneath the data.
Finally, Greg should note that the correct transcoding from UTF-8 to
ISO8859-1 is not to use any substitution characters, but 1) to replace
the character with numeric character entities when the item is in data
content, and 2) to fail when the character is in markup. If you need
more detailed transcoding than that, then it is not something that XML
processors will provide, and you will have to make your own preprocessor.
Now there have been multiple character set formats: indeed, RTF allows
sections in different embedded encodings. The result is that you want to
use a text editor, it must be 8-bit clean (not do any transcoding) and
you have to change the screen encoding to view different sections
correctly. XML did not take this route.
Cheers
Rick Jelliffe
P.S. The most common transcoding error I used to see is where there is a
UTF-8 data stream and someone puts in the byte xA0, intending it to be
the non-breaking space character. More common now is where there is a
UTF-8 stream that has the UTF-16 Byte Order Mark converted to UTF-8
rather than stripped (this is not so much a code error as an operational
error).
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]