Lists Home |
Date Index |
From: "Thomas B. Passin" <firstname.lastname@example.org>
> [Karl Stubsjoen]
>>Here is an outline of my current problem then:
> > 1. original data submitted - unicode "TM" submitted as part of data
> > 2. server side XML generated and encoded as ISO-8859-1
> > 3. ixmlhttprequest made for XML data - which is *blindly* downloaded and
> > encoded as UTF-8
> > 4. MSXML3 chokes when attempting to load xml, error is "Invalid
> > characters..."
> I've really been surprised at all the places that Microsoft is either
> non-conforming or simply does things in a way that can be unworkable in
> certain situations. I've seen in in .NET web services, and SQL server
> querying an xml file for query parameters, and now this.
Actually, I would not be too hard on Microsoft here. (I am happy to supply
other reasons :-)
Throughout the computing world transcoders (the software that converts text
between encodings) typically do not provide proper facilities to cope with
missing characters in the output encoding. If you are lucky, they transcoder
will fail and tell you there is something wrong. But typically transcoders
will just strip or substitute with '?' the missing character.
It is not just Microsoft but the state of play in our computing infrastructure.
When you are working with data in different encodings and Unicode
infrastructure, importing from different encodings is safe but exporting
is not safe. At least, you need to take especial care.
How could API vendors help in this? For a start, they should offer
a mode for all text export so that an encoding error can cause the
export to fail. Even better would be to offer "smart trancoders"
which would allow characters not in the output character encoding
to be replaced by numeric character references (e.g. \uHHHH or
&#xHHHH; ) of various kinds.
A couple of years ago I created a couple of lossless transcoders:
AT&Ts licensing of tcs put the kibosh on the tcs-based version.
Actually, I believe that the general way we think about character
encodings is faulty: we need to think in terms of coping with
variants. The GLUE project (GLUE Loses User Encodings!)
at http://www.ascc.net/xml/en/utf-8/glue.html was an attempt
to move in a different direction, but we dropped it in favour of
Mark Davis' ICU effort which looked promising.
The other culprit is C and byte-based DBMS. The generation
of programmers who grew up expecting a character to be
8-bytes (or expecting that all strings will be in their local encoding)
-- which is my generation -- have made an infrastructure that breaks
easily. The more recent APIs from Java, .NET, Apples etc are
much better in this, but we still have a lot of older code floating
about, and code written by private individuals and contributed
to open source is often really bad in this regard.
Even HTTP has not been immune to this: when you send a
request, what encoding is used? Until recently it was up
in the air.
That is why XML is so strict and definite about encodings:
you have to know every step of the chain. Ultimately many
programmers will conclude that it is simpler to mandate
UTF-8 at every part of their processing chain, whereever
Furthermore, this is why it is important that XML keep
enough characters unused to be able to detect encoding
errors. XML 2.0 should bad all non-whitespace control
for more on that.