[
Lists Home |
Date Index |
Thread Index
]
From: "Alan Kent" <ajk@mds.rmit.edu.au>
> To separate the two issues - I have no opinion on name characters.
> PCDATA however is different. I read through you entire post twice
> and must admit I still don't quite understand what your point is
> exactly. I *think* you might be saying "its good to specify the
> encoding because that way its possible to make sure characters
> not valid in that encoding are rejected." (My reading of the XML spec
> is that 0x85 is legal in the Unicode character set - that is, its
> not marked as UNUSED in the good old SGML jargon.)
Sorry I am not being clear. I am saying that it is vital in practice that
there are enough characters that are UNUSED (and characters that
are NAME characters, SEPCHAR, etc) to catch the most common
mislabellings of character encoding. More than "good": vital.
It is one of the best Software Engineering features of XML: it can
make several very difficult problems effectively disappear.
Everytime someone complains "I cannot process this document because
the XML parser says I have an unexpected code point" it is victory for
software quality and reliable data interchange. Encoding problems
must be detected and dealt with at source, and not allowed to propagate
and corrupt distributed systems.
This is an area in which ASCII developers' judgements about the tradeoffs
may easily be in conflict with the tradeoffs of people from the rest of the world.
If you only ever use ASCII, then having the C1 controls
(80 to 9F) available would probably cause you no grief.
Even if you only use ISO 8859-1, it is still important. The Euro=0x80
mistake will be increasingly common, and we need to make sure that
XML processors continue to catch this error.
Character encodings are hard. Programmers are not trained
to deal with them -- Computer Science classes teach what
a float is and what a byte is, but usually not what a character is
and inevitably not what a multibyte code is. APIs do not
expose the character set (Java is getting better at this).
DBMS do not perform checks that the encoding that their
locale says they use is in fact the one that any particular
string does use. The only place we can verify character
encodings is at the point of data interchange: as XML.
Before XML, the only way that people had to make
reliable systems for data interchange was to agree on a
common encoding. That is impractical for all sorts of reasons.
XML has given us an alternative, where it can be safe, in
practice in many situations, to use different encodings because we
have a coarse net that exposes mislabelled encodings.
To allow x80 to x9F and to allow silly characters in XML names
takes us back to the 80s, where there were no checks at any part
of a system that encodings were correct. Unrestricted ranges
is not the future, it is the past, and a past that failed miserably.
> If this is your point, then would it be possible to define a new
> encoding which permitted the full range of Unicode characters
> (including control characters which are valid in Unicode).
> Would that address your issues?
I don't believe the world is crying out for more encodings :-)
When we consider the solutions available, we do not have
the ability to force people to choose encodings or the ability
to make APIs which transmit the character encoding: not only
because we are not ISO or Microsoft, but because Pandora's
box is already open (or the horse has already bolted).
> But I must admit that I do not understand why allowing control
> characters in PCDATA results in "we won't actually increase the number
> of characters that can be reliable sent: we will just make non-ASCII
> characters suspect and unreliable." It may make translation between
> different character sets harder, but hey - how do I turn Unicode
> encoded chinese into plain ASCII? My point is that not permitting
> a small number of characters does not solve all such problems.
(Off the point, one can transmit Chinese in ASCII using numeric character
references.)
UTF-8, VISCII and Big5 all use the code points x80 to x9F.[1] Most transcoding
systems that read ISO8859-1 into Unicode will merely convert unsigned bytes to
unsigned shorts to read the data in. So merely by labelling an XML file as ISO8859-1
is enough to silence most XML processors, in the absense of a restriction of the C1 control characters.
If x80 to x9F are errors, then it become a statisical issue of how many non-ASCII
characters can occur in, say, a UTF-8, VISCII or Big 5 document labelled as ISO 8859-1 before the error is detected.
As far as the issue of reliability and "trust", Alan is certainly
correct that diallowing x80 to x9F will not catch some errors
("solve all problems") such as mislabelling ISO8859-15 as ISO 8859-1
(especially if element names are all ASCII-repertoire characters). And
in some case it may be some small statistical number of
characters before the problem is detected (e.g. a Chinese
Big5 character has a one in eight chance of having a second
byte x80-x9F, ignoring the higher rates of some common characters
such as MA, so we can expect that the problem will be
detected for most documents with more than 8 Chinese
characters.)
But XML does not need to "solve all problems". It just needs needs
to catch an adequate number of important problems in a straightforward
way, where users can have statistical expectations about problems detected.
There are some problems (such as whether a Japanese document
is using backslash or Yen) which cannot be detected by this method,
which is a pity for the people with those problems, not some sign
that the C1 restrictions are not worthwhile.
> If you are only talking about name characters (element names, attribute
> names etc), then that is a different matter.
Restricting the control characters catches some significant problems. Restricting the
name characters catches some more.
> But I think its wrong to put too much trust into XML to protect
> against data corruption. This seems (to me) to be a poor rationale
> for omitting a small select number of characters.
In the abstract, sure. But since there is *nothing* else that catches
such errors, the abstract is irrelevent. It is not "data corruption" in
the sense of errors that creep in, it is data corruption by programmer
action. APIs usually just use the default encoding of the locale to
serialize text, so there is no way for programmers to be aware that
they have mislabelled their documents' encoding unless the XML
processor tells them.
Anyone who has had to populate a database with feeds coming
in in different character sets knows that keeping track of the
character encoding is vital. Without it, the database becomes
useless.
I would be interested in the people who say we should make
C1 available specifying an alternative way to detect these
errors, or explaining why the problem is not real. (I am sure
that developers from China, Japan, Korea and Vietnam would
be interested that character encoding issues cause so few
problems that we should not have machine checks to help us.)
Cheers
Rick Jelliffe
[1] For more info, see my old GLUE transcoder project:
http://www.ascc.net/xml/en/utf-8/glue.html
|