[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: SAX InputSource and character streams
- From: Rick Jelliffe <ricko@allette.com.au>
- To: xml-dev@lists.xml.org
- Date: Wed, 21 Feb 2001 17:35:50 +0800
From: Rob Lugt <roblugt@elcel.com>
>... So, in
>this case, by deciding to pass the SAX Parser a character stream, the
>application has taken on part of the responsibility of an XML processor -
>namely the responsibility of dealing with any encoding issues, thereby
>relieving the SAX processor of any need, indeed any right, to have an
>opinion on how the encoding is performed.
For encoding, the general rule is that (reliable) information provided by a
higher-level protocol has preference over the header. So if the XML header
says the entity is shift-JIS, but a Japanese transcoding proxy has coverted
the entity to Japanese EUC encoding and rewritten the MIME header
accordingly, then the MIME header should be used.
So a processor needs some mechanism to over-ride auto-detection.
However, because transcoding proxies are only an issue in a few (one?)
countries and perhaps for gateway-ing EBCDIC onto the WWW, and because
there may be a legitimate expectation that application/xml should never be
transcoded anyway, in effect a lot of applications will not override the XML
declaration.
(If the proxy has not rewritten the MIME header, then parsing the entity
should fail at the first occurrence of a code sequence that could not be
shift-JIS. If the entity is saved to a file to a without fixing up the XML
header then parsing that file should fail at the first occurrence of a code
sequence that could not be shift-JIS. If the chain of information breaks,
the entity is lost; fair enough.)
Actually, I think there is nothing stopping a processor from being very
strict, and rejecting application/*xml entities if the XML header and
the MIME header disagree: this would rule out transcoding proxies that do
not rewrite the XML header. I think that is a perfectly appropriate
approach, but it may go beyond what XML specifies. (For text/*xml,
transcoding or line-break fiddling is a desirable feature.)
Personally, I think the MIME headers are inappropriate way to specify
character encoding. This is because setting system defaults is a
system-administrator task, and even setting local Apache .htaccess
directory- defaults is too much for normal users. Is XML software, when
writing out an entity, suppposed to also rewrite any .htaccess file? What
about the config file formats for other webservers? The encoding
information should not be labelled in-band with the entity: Apple's late
lamented resource forks would have been fine for this. So the technology is
not in place for reliable end-to-end out-of-band signalling, though it can
be done if you have control over every step in the chain. It only works
because most people are use a single encoding for all their work or for a
particular language: not the case for many people (e.g. Singapore has 4
languages and three scripts in common use). At my multilingual site, I
ended up reserving directories for different encodings.
And, ultimately, the solution is to use UTF-8 for all web-transmissions and
data files and to correctly set their servers to provide the correct
information. (For people concerned with Chinese/Japanese/Korean file
blowout with UTF-8, the answer is that compressed UTF-8 is about the same
size as compressed UTF-16, comressed Big5, etc.)
Cheers
Rick Jelliffe
Taipei