[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8")put Inside the XML Document?
- From: "Rick Jelliffe" <rjelliffe@allette.com.au>
- To: "Philippe Poulard" <philippe.poulard@sophia.inria.fr>
- Date: Fri, 21 Sep 2007 11:30:08 +1000 (EST)
Philippe Poulard said:
>
> I guess some parsers have additional heuristics for reading successfully
> the sequence <?xml encoding="blah-blah"?> ; maybe some try-catch to
> apply with the set of charset they know ?
I hope they don't, unless they are specific tools for repairing broken
documents.
Guessing encoding is the *opposite* of the XML approach and should be
strongly resisted. The XML approach is based on explicit labeling as the
only approach that is reliable (which is not the same as not-stuff-up-able
of course).
There are many problems with guessing:
* most platforms provide hundreds of character sets
* most character sets belong to families which are ASCII or EBCDIC
superrsets so there is not enough redundant (in the engineering-theoretic
sense) information or orthogonality to know which specific sets are
actually being used
* most transcoders don't actually generate exceptions when an unknown
byte sequence is found: older ones just ignored the sequence, others
replace it with "?" or some other character, some more recent transcoders
are a little better, so you cannot know
* detecting encoding from statistical patterns in the text relies on the
document conforming to the corpuse, to a certain extent, and may even be
skewed by the use of native language markup.
* guessing prevents error detection
* guessing can corrupt the database
So the XML system is then based on solving the problem "How do we read
that label reliably?" The UTF-8 default is just low hanging fruit,
because it also accepts ISO646-US (ASCII), but again it is not in any
sense guessed.
A system that guesses encoding is unsuitable for critical data. In a
hospital record, you don't want your name to be rejected because it has
some Hungarian character but you are in a German hospital, etc.
Cheers
Rick Jelliffe
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]