[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Why is Encoding Metadata (e.g. encoding="UTF-8) putInside the XML Document?
- From: Jonathan Robie <jonathan.robie@redhat.com>
- To: "Costello, Roger L." <costello@mitre.org>
- Date: Wed, 19 Sep 2007 08:56:56 -0400
Costello, Roger L. wrote:
> Typically XML and HTML documents are exchanged on the Internet using
> the HTTP protocol.
When they are, software that sends an existing XML document can use the
encoding to determine how to set the MIME type. But XML documents live
in many other places, they may be stored in repositories or on hard
disks, for instance, where they are not accompanied by a MIME type.
Also, XML parsers generally don't have access to the MIME type. They do
have access to the document.
Of course, many parsers also manage to parse XML documents that don't
declare their encoding just fine, at least for the expected character
sets. The prolog is not required to have an XML declaration, and the XML
declaration is not required to have an encoding declaration:
[1] document ::= prolog element Misc*
[22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?
[23] XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
> But that raises an intriguing question: in order to read the document
> you need to know what its encoding is, but to know what the encoding is
> you must read the document!
>
Autodetection of character encodings in XML documents is discussed in
some detail here:
http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing
> These are all ASCII characters.
The XML encoding declaration is restricted to characters taken from the
ASCII repertoire specifically to make this kind of character encoding
guessing easier, as discussed in the appendix referenced above.
> From then on, it interprets the rest of the document
> using the encoding it found in the XML declaration.
>
Yes.
> Likewise, all HTML documents must begin with a header section:
>
> <html>
> <head>
> <meta http-equiv="Content-Type" content="text/html;
> Charset="UTF-8" />
>
Here's a useful excerpt from the XHTML spec:
C.9. Character Encoding
Historically, the character encoding of an HTML document is either
specified by a
web server via the charset parameter of the HTTP Content-Type header, or
via
a meta element in the document itself. In an XML document, the character
encoding
of the document is specified on the XML declaration
(e.g., <?xml version="1.0" encoding="EUC-JP"?>). In order to portably
present
documents with specific character encodings, the best approach is to
ensure that
the web server provides the correct headers. If this is not possible, a
document
that wants to set its character encoding explicitly must include both
the XML
declaration an encoding declaration and a meta http-equiv statement
(e.g., <meta http-equiv="Content-type" content="text/html;
charset=EUC-JP" />).
In XHTML-conforming user agents, the value of the encoding declaration
of the XML
declaration takes precedence.
Note: be aware that if a document must include the character encoding
declaration
in a meta http-equiv statement, that document may always be interpreted
by HTTP
servers and/or user agents as being of the internet media type defined
in that
statement. If a document is to be served as multiple media types, the
HTTP server
must be used to set the encoding of the document.
Hope this is helpful!
Jonathan
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]