Lists Home |
Date Index |
On a related matter: some people recently suggested that allowing C0
directly would make incorrect encoding labels harder to detect.
0x00 excepted, I am not aware of any case where this is so. In EBCDIC
and ASCII-family encodings, there are (always?) no bytes in sequences
which share the 0x01-0x1F byte. I am interested in any details people
have uncovered in this regard.
Constrast this with the C1 characters: U+0080-U+009F. There
are many encodings which use the bytes 0x80-0x9F (alone or in sequence)
to represent non-control characters.
So the argument against literal C0 characters is that inband control
are transmission artifacts that have no place literally in data. Use
to get the character but escape the control semantic.
The argument against literal C1 characters is that taking advantage of
code-point redundency is our only opportunity to detect many
kinds of mislabelled encodings. The number of people making
documents who would benefit from this greatly outnumbers the
number of people who have to send data with literal C1 characters.
(Indeed, because there is no XML Infoset difference between a literal
character and a numberic reference, anyone who requires that certain
characters are represented literally goes beyond common XML
aayway.) Use references to get the character without foregoing the
XML 1.1 is better than XML 1.0 in these two cases, IMHO.
I believe the rationales for C0 and C1 to be good engineering, rather
than goodwill to Ethopians (who surely need our goodwill.) XML
must fit in with IETF protocols concerning C0; the C0 controls may
indeed be obsolete, but the place to fix that is in the RFCs. Good
engineering also dictates that we can measure problems as far as
possible as close as possible to the source; hence the C1s.