Lists Home |
Date Index |
From: "Michael Rys" <firstname.lastname@example.org>
> Well, that may have been the original XML 1.0 use, but looking at where
> XML is currently having the most traction (SOAP, Messaging, WebDav,
> database serialization etc), this has changed.
One big advantage of disallowing control characters from XML documents
and silly characters from XML names is that it catches most common encoding errors.
For example, the very common problem of data labelled ISO 8859-1 containing
a 0x85 byte (for the Euro character).
At the moment XML provides the only disiplined point in the processing chain:
when data is in XML one *must* have the encoding correct. This may
cause some consternation to us programmers, who perhaps have lived in a fool's
paradise where encoding does not matter, but it is a fundamental point
of Quality Control for XML documents and exposes data corruption at the point
where it can be corrected.
To allow control characters would make us sink back into the horrible mess
that everyone familiar with working in multi-character set environments without
XML is well aware (or, at least, becomes well aware when everything comes
Most DBMS systems do not perform any checking of encoding. So you
can store almost anything in, say, a DBMS expecting ISO 8859-1. With
a world full of data incorrectly labelled, there is no chance of good
interoperability without some basic checking. And those basic checks
are what XML's data character and naming rules provide.
Without them, sure XML would be "simpler" and we could attempt to transmit
arbitrary strings around. But then encoding detection or repair would be
the problem of the recipient and not the sender: a responsible recipient
can have no faith that their non-ASCII data has not been corrupted.
And that lies at the heart of the matter: if we allow control characters
and silly name characters, we won't actually increase the number of
characters that can be reliable sent: we will just make non-ASCII
characters suspect and unreliable.