Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8")

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8")put Inside the XML Document?

From: Jim Ancona <jim@anconafamily.com>
To: "Costello, Roger L." <costello@mitre.org>
Date: Thu, 20 Sep 2007 09:43:29 -0400

Hi Roger,

Thanks for distilling this kind of information.

Costello, Roger L. wrote:
> I have incorporated your comments.  Please let me know if I am missing
> anything, or have incorrectly interpreted your comments:
> 
> http://www.xfront.com/specifying-encoding/
> 
> I am particularly interested in hearing if you agree with the
> recommendations that I list.

When discussing encoding detection, you write:

     If the external information is unreliable or unavailable then a
     parser examines the first 4 bytes of the document. XML and HTML
     documents optionally have a Byte Order Mark (BOM) in the first 4
     bytes. The BOM may indicate the encoding. So if the document has a
     BOM then the parser may be able to determine the document's
     encoding.

This is not technically correct, because a BOM is not required for the 
auto-detection algorithm to work. See [1], which describes cases both 
with and without a BOM, for encodings including UCS-4 and UTF-16 
(big-endian and little-endian), EBCDIC, as well as UTF-8, ISO 646, 
ASCII, and other encodings that have the ASCII characters in their 
normal positions.

See also David Carlisle's comments, which cover some of the same issues, 
and arrived while I was composing this message.

It would also be useful for your document to link to relevant specs, for 
example [1].

Jim Ancona

[1] XML 1.0 Reccomendation, Appendix F: Autodetection of Character 
Encodings, http://www.w3.org/TR/REC-xml/#sec-guessing

References:
- Why is Encoding Metadata (e.g. encoding="UTF-8) put Inside the XML Document?
  - From: "Costello, Roger L." <costello@mitre.org>
- RE: [xml-dev] Why is Encoding Metadata (e.g. encoding="UTF-8) put Inside the XML Document?
  - From: "Michael Kay" <mike@saxonica.com>
- Re: [xml-dev] Why is Encoding Metadata (e.g. encoding="UTF-8) putInside the XML Document?
  - From: Jonathan Robie <jonathan.robie@redhat.com>
- Re: [xml-dev] Why is Encoding Metadata (e.g. encoding="UTF-8) putInside the XML Document?
  - From: "Rick Jelliffe" <rjelliffe@allette.com.au>
- [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
  - From: "Costello, Roger L." <costello@mitre.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]