RE: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8")

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

RE: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?

From: "Costello, Roger L." <costello@mitre.org>
To: <xml-dev@lists.xml.org>
Date: Thu, 20 Sep 2007 09:52:46 -0400


Very interesting!

Here's the description I now have:

Assuming external information did not decide the encoding ...

An XML Parser will make an initial "guess" of the encoding based upon
the presence or absence of a Byte Order Mark (BOM). The XML parser then
interprets the bit strings using that guess up to the first ">"
character (the end of the XML declaration).  Now that it knows the
"real" encoding it interprets the rest of the document using the
encoding it found in the XML declaration.

Do I have it correct?

/Roger


-----Original Message-----
From: David Carlisle [mailto:davidc@nag.co.uk] 
Sent: Thursday, September 20, 2007 9:08 AM
To: Costello, Roger L.
Cc: xml-dev@lists.xml.org
Subject: Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g.
encoding="UTF-8") put Inside the XML Document?



> These are all ASCII characters. Thus, an XML parser opens the
> document, interprets the bit strings as ASCII characters up to the
> first ">" 

No as was said earlier, the first few bytes of the file do not need to
be
read as ascii. (And must not be for several popular encodings such as
utf-16 for example)

It's true that the characters  that appear in an encoding declaration
are characters that do have an ASCII encoding, but there is no
requirement that the byte sequence that represents the encoding
declaration uses the ASCII encoding.

  These are all ASCII characters. Thus, an XML parser opens the
document,
  interprets the bit strings as ASCII characters up to the first ">"
  character. From then on, it interprets the rest of the document using
  the encoding it finds in the XML declaration. 

The entire document, including the encoding declaration, is read
using the same encoding.



> Algorithm for Detection of the Character Encoding when there is no
> Internal Encoding Label

That isn't the same as the algorithm given in XML.
There, if there is no external metadata or xml declaration the file has
to be in utf16 or utf8, and the BOM is optional for utf8, so if the
file
has no BOM, then the parser does not "give up" The file is treated as
if
utf8 is specified.

Recommendation 3
  HTTP Header: specifying the encoding in an HTTP header is
  unreliable. When exchanging XML or HTML documents using the HTTP
  protocol, don't specify the Content-Type in the HTTP header. This
will
  force applications to look inside the document for encoding
  information. 

is explictly the opposite of the  the RFC that defines the XML mime
types, so while there are arguments on both sides I think its dangerous
to state it as such a clear recommendation. In eth case of text/* mime
types (at least) I believe that the default charset is latin-1 so
effectively you _can't_ omit the charset: even if you don't specify it
explictly the receiver is supposed to act as if iso8859-1 is specified
(which will mean that if you don't specify a charset in the mime
headers
then any utf8 document that has a non ascii character in it will be
parsed as  iso8859-1 and generate a fatal encoding error....

David


_______________________________________________________________________
_
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 
_______________________________________________________________________
_

Follow-Ups:
- [Summary #2] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
  - From: "Costello, Roger L." <costello@mitre.org>
- Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
  - From: David Carlisle <davidc@nag.co.uk>
- Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8")put Inside the XML Document?
  - From: Dave Pawson <davep@dpawson.co.uk>

References:
- Why is Encoding Metadata (e.g. encoding="UTF-8) put Inside the XML Document?
  - From: "Costello, Roger L." <costello@mitre.org>
- RE: [xml-dev] Why is Encoding Metadata (e.g. encoding="UTF-8) put Inside the XML Document?
  - From: "Michael Kay" <mike@saxonica.com>
- Re: [xml-dev] Why is Encoding Metadata (e.g. encoding="UTF-8) putInside the XML Document?
  - From: Jonathan Robie <jonathan.robie@redhat.com>
- Re: [xml-dev] Why is Encoding Metadata (e.g. encoding="UTF-8) putInside the XML Document?
  - From: "Rick Jelliffe" <rjelliffe@allette.com.au>
- [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
  - From: "Costello, Roger L." <costello@mitre.org>
- Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
  - From: David Carlisle <davidc@nag.co.uk>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]