XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8")put Inside the XML Document?

Philippe Poulard said:
>
> I guess some parsers have additional heuristics for reading successfully
> the sequence <?xml encoding="blah-blah"?> ; maybe some try-catch to
> apply with the set of charset they know ?

I hope they don't, unless they are specific tools for repairing broken
documents.

Guessing encoding is the *opposite* of the XML approach and should be
strongly resisted. The XML approach is based on explicit labeling as the
only approach that is reliable (which is not the same as not-stuff-up-able
of course).

There are many problems with guessing:

 * most platforms provide hundreds of character sets
 * most character sets belong to families which are ASCII or EBCDIC
superrsets so there is not enough redundant (in the engineering-theoretic
sense) information or orthogonality to know which specific sets are
actually being used
 * most transcoders don't actually generate exceptions when an unknown
byte sequence is found: older ones just ignored the sequence, others
replace it with "?" or some other character, some more recent transcoders
are a little better, so you cannot know
 * detecting encoding from statistical patterns in the text relies on the
document conforming to the corpuse, to a certain extent, and may even be
skewed by the use of native language markup.
 * guessing prevents error detection
 * guessing can corrupt the database

So the XML system is then based on solving the problem "How do we read
that label reliably?"  The UTF-8 default is just low hanging fruit,
because it also accepts ISO646-US (ASCII), but again it is not in any
sense guessed.

A system that guesses encoding is unsuitable for critical data. In a
hospital record, you don't want your name to be rejected because it has
some Hungarian character but you are in a German hospital, etc.

Cheers
Rick Jelliffe


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS