OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [xml-dev] Text/xml with omitted charset parameter

> From: Bjoern Hoehrmann [mailto:derhoermi@gmx.net]
> Sent: Thursday, October 25, 2001 6:07 PM
> To: ietf-xml-mime@imc.org
> Cc: xml-dev@lists.xml.org
> Subject: [xml-dev] Text/xml with omitted charset parameter
> Hi,
> Quoting RFC 3023, section 8.5:
> | 8.5 Text/xml with Omitted Charset
> | 
> |    Content-type: text/xml
> | 
> |    {BOM}<?xml version="1.0" encoding="utf-16"?>
> | 
> |    or
> | 
> |    {BOM}<?xml version="1.0"?>
> | 
> |    This example shows text/xml with the charset parameter 
> omitted.  In
> |    this case, MIME and XML processors MUST assume the 
> charset is "us-
> |    ascii",
> ... and issue a fatal error, no BOM in US-ASCII. Mentioning UTF-16 in
> this example is absurd, XML documents labeled as text/xml without
> charset parameter can never ever be UTF-16 encoded. So, who tells me I
> am wrong and text/xml documents without charset parameter may still be
> UTF-8 encoded (and use non-ASCII characters)? Apache uses text/xml as
> default type for .xml documents, are they asking for interoperability
> problems or what?

Mentioning UTF-16 in this example is not absurd at all. It describes a
scenario that could easily arise in the real world -- a UTF-16 encoded XML
document encapsulated in a MIME envelope in which the Content-Type header
does not include a charset parameter. This RFC states that in such a
scenario, compliant processors must treat the document as being US-ASCII,
which as you correctly point out would lead to a processing error. The key
point is that for the text/xml media type, the charset parameter is
authoritative. Failure to provide that in any instance where the document
uses any character encoding other than US-ASCII is an error, regardless of
any BOM or encoding declaration in the XML document.

And yes, just serving up "text/xml" with no charset parameter is asking for
interoperability problems. But to be honest, this RFC is widely violated by
many software packages and products on the market. Many products ignore the
headers and just go by the encoding declaration in the XML (or assume UTF-8
if that is not present). So serving up XML documents using US-ASCII
character encoding and omitting the charset parameter would also be asking
for interoperability problems, even though it complies with this RFC.
Unfortunately, there is a hell of a lot of software out there that just uses
"text/xml" with no charset parameter. Apache certainly isn't the only
offender. So when you want to write software that can accept XML via HTTP or
within MIME envelopes, you are going to encounter interoperability headaches
no matter what.

Best way to ensure interoperability is:
* Always use UTF-8
* Always include the appropriate charset parameter
* If, for some reason, you must use another character encoding, include the
appropriate charset parameter as well as a redundant encoding declaration in
the XML

Of course, building in a default media type with an appropriate charset
parameter in a web server product poses an obvious challenge: how is the
product supposed to know what character encoding is used for text documents
on the server upon which it will be installed? I suppose the product could
special case XML documents and parse them each time before serving them up
to check for an encoding declaration or auto-detect the encoding. But it's
probably better for the server administrator to ensure a consistent
character encoding is used for the documents, and that the XML media type
configured on the server includes the charset parameter.