Re: [xml-dev] Why is Encoding Metadata (e.g. encoding="UTF-8) putInside

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Re: [xml-dev] Why is Encoding Metadata (e.g. encoding="UTF-8) putInside the XML Document?

From: Jonathan Robie <jonathan.robie@redhat.com>
To: "Costello, Roger L." <costello@mitre.org>
Date: Wed, 19 Sep 2007 08:56:56 -0400

Costello, Roger L. wrote:
> Typically XML and HTML documents are exchanged on the Internet using
> the HTTP protocol.  

When they are, software that sends an existing XML document can use the 
encoding to determine how to set the MIME type. But XML documents live 
in many other places, they may be stored in repositories or on hard 
disks, for instance, where they are not accompanied by a MIME type.

Also, XML parsers generally don't have access to the MIME type. They do 
have access to the document.

Of course, many parsers also manage to parse XML documents that don't 
declare their encoding just fine, at least for the expected character 
sets. The prolog is not required to have an XML declaration, and the XML 
declaration is not required to have an encoding declaration:

[1] document ::= prolog element Misc*
[22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?
[23] XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'

> But that raises an intriguing question: in order to read the document
> you need to know what its encoding is, but to know what the encoding is
> you must read the document! 
>   

Autodetection of character encodings in XML documents is discussed in 
some detail here:

http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing

> These are all ASCII characters.  

The XML encoding declaration is restricted to characters taken from the 
ASCII repertoire specifically to make this kind of character encoding 
guessing easier, as discussed in the appendix referenced above.

> From then on, it interprets the rest of the document
> using the encoding it found in the XML declaration.
>   

Yes.

> Likewise, all HTML documents must begin with a header section:
>
> <html>
>     <head>
>         <meta http-equiv="Content-Type" content="text/html;
> Charset="UTF-8"  />
>   

Here's a useful excerpt from the XHTML spec:

C.9. Character Encoding

Historically, the character encoding of an HTML document is either 
specified by a
web server via the charset parameter of the HTTP Content-Type header, or 
via
a meta element in the document itself. In an XML document, the character 
encoding
of the document is specified on the XML declaration
(e.g., <?xml version="1.0" encoding="EUC-JP"?>). In order to portably 
present
documents with specific character encodings, the best approach is to 
ensure that
the web server provides the correct headers. If this is not possible, a 
document
that wants to set its character encoding explicitly must include both 
the XML
declaration an encoding declaration and a meta http-equiv statement
(e.g., <meta http-equiv="Content-type" content="text/html; 
charset=EUC-JP" />).
In XHTML-conforming user agents, the value of the encoding declaration 
of the XML
declaration takes precedence.

Note: be aware that if a document must include the character encoding 
declaration
in a meta http-equiv statement, that document may always be interpreted 
by HTTP
servers and/or user agents as being of the internet media type defined 
in that
statement. If a document is to be served as multiple media types, the 
HTTP server
must be used to set the encoding of the document.

Hope this is helpful!

Jonathan

References:
- Why is Encoding Metadata (e.g. encoding="UTF-8) put Inside the XML Document?
  - From: "Costello, Roger L." <costello@mitre.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]