[
Lists Home |
Date Index |
Thread Index
]
On Wednesday 19 December 2001 02:24 am, Alan Kent wrote:
> To separate the two issues - I have no opinion on name characters.
> PCDATA however is different. I read through you entire post twice
> and must admit I still don't quite understand what your point is
> exactly. I *think* you might be saying "its good to specify the
> encoding because that way its possible to make sure characters
> not valid in that encoding are rejected." (My reading of the XML spec
> is that 0x85 is legal in the Unicode character set - that is, its
> not marked as UNUSED in the good old SGML jargon.)
>
> If this is your point, then would it be possible to define a new
> encoding which permitted the full range of Unicode characters
> (including control characters which are valid in Unicode).
> Would that address your issues?
The point is that characters != bytes != encoding. If you start allowing
control characters (which are somewhat debatable *as* characters in the first
place), it becomes very easy to abuse the power and to have
application-specific uses of embedded encodings. This is effectively what Mr.
Rhys from MS wanted: the ability to store arbitrary binary streams inside XML
encoded data.
The problem is that XML is *text*. It is made from *characters*, and
arbitrary binary strings have no place in it. Once you change that, you have
essentially ruined XML as a textual markup language.
People could say that NUL et al. are still *characters* and so would be fine,
even in UTF-8 encoded documents, but I bet they'd be rather unhappy to find
their binary streams changing if I saved the document as UTF-16.
The point here is that these things are unreliable.
|