Lists Home |
Date Index |
On Wed, Dec 19, 2001 at 06:02:55PM +1100, Rick Jelliffe wrote:
> One big advantage of disallowing control characters from XML documents
> and silly characters from XML names is that it catches most common encoding
> For example, the very common problem of data labelled ISO 8859-1 containing
> a 0x85 byte (for the Euro character).
> And that lies at the heart of the matter: if we allow control characters
> and silly name characters, we won't actually increase the number of
> characters that can be reliable sent: we will just make non-ASCII
> characters suspect and unreliable.
> Rick Jelliffe
To separate the two issues - I have no opinion on name characters.
PCDATA however is different. I read through you entire post twice
and must admit I still don't quite understand what your point is
exactly. I *think* you might be saying "its good to specify the
encoding because that way its possible to make sure characters
not valid in that encoding are rejected." (My reading of the XML spec
is that 0x85 is legal in the Unicode character set - that is, its
not marked as UNUSED in the good old SGML jargon.)
If this is your point, then would it be possible to define a new
encoding which permitted the full range of Unicode characters
(including control characters which are valid in Unicode).
Would that address your issues?
But I must admit that I do not understand why allowing control
characters in PCDATA results in "we won't actually increase the number
of characters that can be reliable sent: we will just make non-ASCII
characters suspect and unreliable." It may make translation between
different character sets harder, but hey - how do I turn Unicode
encoded chinese into plain ASCII? My point is that not permitting
a small number of characters does not solve all such problems.
Or have I missed the whole point (I have jumped in late into this
discussion) - in which case sorry for muddying the waters.
If you are only talking about name characters (element names, attribute
names etc), then that is a different matter.
But I think its wrong to put too much trust into XML to protect
against data corruption. This seems (to me) to be a poor rationale
for omitting a small select number of characters. But as I said, I
may have missed your point. But currently to me you have not made
a convincing argument (for PCDATA). Whether I count - well that is
another matter! :-)