Re: [xml-dev] Why is Encoding Metadata (e.g. encoding="UTF-8) putInside

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Why is Encoding Metadata (e.g. encoding="UTF-8) putInside the XML Document?
From: Rick Jelliffe <rjelliffe@allette.com.au>
To: "Costello, Roger L." <costello@mitre.org>
Date: Thu, 20 Sep 2007 00:27:58 +1000
The encoding declaration in the XML exists because using anything else
is a fantasy, and as is unfortunately common with fantasies, unreliable.

The fantasy is that you can have a processing chain based on text and
existing APIs where information on the appropriate encoding will be
passed out-of-band to the document. However the person or process who
creates a file is not necessarily the person or process that serves the
file. And default save and read APIs use various kind of defaults. If
everyone had adopted the old Mac "resource fork" files, then there would
be some hope.

So XML took the only sensible route, which is to say that the standards
on character encoding, including HTTP are in fact wrong and broken. XML
uses very good wording for this: you use the encoding header unless you
know the encoding from elsewhere:-- since MIME/HTTP content type charset
is not reliable in practice people can use the XML encoding header in
preference to MIME/HTTP without breaking the XML spec. That they may go
against the words of the MIME/HTTP spec is just a sign of its
brokenness: but the brokenness is the brokenness of trying to pretend
that metadata exists that systems and APIs don't provide.

By the mid 1990s it was clear that HTTP/MIME was broken in this area.
XML created a good system that has worked well. It is a real pity that
the developers of other text systems have not had the comprehension to
pick up on it for other text: you can have the equivalent of annex F
algorithm using different comment (or PI) delimiters for other text
formats. 

But it makes some people uneasy to be so dismissive of MIME/HTTP like
this. They don't like it that one standard provides a different way to
do things that another standard that the think should be driving affairs
does. Surely there should be only one way of doing things, even if it
doesn't work, seems to the hope.

And, indeed, HTTP/MIME does provide a better way out. That way is in the
RFC for MIME content types for XML. You never ever serve text/xml. You
serve application/xml. 

As for HTML, browser do a fairly complex kinds of sniffing to figure out
the charset. They try to make a sensible choice given the locale of the
system, the charset in the HTML document, the charset in the MIME
header, the most recent handful of encodings used by documents, and byte
signatures in the file. (See the mozilla charset sniffing code, for an
example.) What they don't do is just use the MIME as in the RFC, because
it is inadequate. 

It wasn't my understanding that HTML browsers use the charset in the
same way as XML systems do, and I think it gives the impression that
HTML's handling of encoding is not broken (from the reliability aspect)
to treat it as if it were the same as XML in practice or in theory. In
XML, the annex F algorithm gives a clear sequence for when the encoding
is not known: look for the BOM mark, read the first PI (= first line
now), parse that as ASCII or EBCIDIC, use the encoding pseudo-attribute.
Read in the file using that encoding (preferably not treating the XML
header as a PI.)  So you have only quite small buffer to populate. With
HTML, there could be all sorts of date before the charset parameter if
any, and you don't know if it corresponds to the document anyway
(because an intermediate system can transcode text/* into other
character sets if it wants to.) Certainly HTML systems have to start
parsing the HTML file from the beginning if they find a charset
parameter: there is no guarantee that there would be only ASCII
characters prior to the charset parameter.

The inexorable rise of UTF-8 also makes the explicit labeling of charset
otiose (whether by MIME header, charset, XML header, etc.) but it too is
another example of in-band charset signalling.

Cheers
Rick Jelliffe




On Tue, 2007-09-18 at 19:53 -0400, Costello, Roger L. wrote:
> Hi Folks,
> 
> Below I describe my understanding of:
> 1. Why the indication of how an XML document is encoded is placed
> "within" the document, and
> 2. How an XML parser is able to parse an XML document before it even
> knows its encoding.
> 
> I would appreciate any comments on where I err.  
> 
> ------------------------------------------
> 
> It is considered best practice to embed within your document an
> indication of the encoding used to create the document.
> 
> For example, in XML documents you put encoding information in the XML
> declaration:
> 
>      <?xml version="1.0" encoding="UTF-8"?>
> 
> In HTML documents you put encoding information in the header section:
> 
> <html>
>     <head>
>         <meta http-equiv="Content-Type" content="text/html;
> Charset="UTF-8"  /> 
> 
> Why? Shouldn't metadata be external to a document?
> 
> Typically XML and HTML documents are exchanged on the Internet using
> the HTTP protocol.  The HTTP header has a property to indicate the
> charset (encoding) of its payload (i.e. the XML document or the HTML
> document), e.g.
> 
>     Content-Type: text/xml; Charset="UTF-8"
> 
> Isn't the HTTP header sufficient to specify a document's encoding?
> 
> Suppose you have a big web server with lots of sites and hundreds of
> pages, contributed by lots of people in lots of different languages.
> The web server wouldn't know the encoding of each document.  
> 
> So it is considered best practice to specify the encoding within the
> document itself.
> 
> But that raises an intriguing question: in order to read the document
> you need to know what its encoding is, but to know what the encoding is
> you must read the document! 
> 
> Stated differently, for an XML parser to know how to interpret the bit
> strings in a document it must know the encoding, but to know the
> encoding it must read the document!
> 
> We seem to have a chicken-and-egg situation.  How is this handled?
> 
> Here's how: all XML documents must begin with this XML declaration:
> 
>     <?xml version="1.0" encoding="..."?>
> 
> These are all ASCII characters.  Thus, an XML parser opens the
> document, interprets the bit strings as ASCII characters up to the
> first ">" symbol.  From then on, it interprets the rest of the document
> using the encoding it found in the XML declaration.
> 
> Likewise, all HTML documents must begin with a header section:
> 
> <html>
>     <head>
>         <meta http-equiv="Content-Type" content="text/html;
> Charset="UTF-8"  />
> 
> These are all ASCII characters.  Thus, an HTML parser opens the
> document, interprets the bit string as ASCII characters up to the end
> of the header section.  From then on, it interprets the rest of the
> document using the encoding it found in the meta tag.
> 
> ---------------------
> 
> Do you agree?  /Roger
> 
> _______________________________________________________________________
> 
> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
> to support XML implementation and development. To minimize
> spam in the archives, you must subscribe before posting.
> 
> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
> subscribe: xml-dev-subscribe@lists.xml.org
> List archive: http://lists.xml.org/archives/xml-dev/
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>
References:
- Why is Encoding Metadata (e.g. encoding="UTF-8) put Inside the XML Document?
  - From: "Costello, Roger L." <costello@mitre.org>
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]