[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8")put Inside the XML Document?
- From: Jim Ancona <jim@anconafamily.com>
- To: "Costello, Roger L." <costello@mitre.org>
- Date: Thu, 20 Sep 2007 09:43:29 -0400
Hi Roger,
Thanks for distilling this kind of information.
Costello, Roger L. wrote:
> I have incorporated your comments. Please let me know if I am missing
> anything, or have incorrectly interpreted your comments:
>
> http://www.xfront.com/specifying-encoding/
>
> I am particularly interested in hearing if you agree with the
> recommendations that I list.
When discussing encoding detection, you write:
If the external information is unreliable or unavailable then a
parser examines the first 4 bytes of the document. XML and HTML
documents optionally have a Byte Order Mark (BOM) in the first 4
bytes. The BOM may indicate the encoding. So if the document has a
BOM then the parser may be able to determine the document's
encoding.
This is not technically correct, because a BOM is not required for the
auto-detection algorithm to work. See [1], which describes cases both
with and without a BOM, for encodings including UCS-4 and UTF-16
(big-endian and little-endian), EBCDIC, as well as UTF-8, ISO 646,
ASCII, and other encodings that have the ASCII characters in their
normal positions.
See also David Carlisle's comments, which cover some of the same issues,
and arrived while I was composing this message.
It would also be useful for your document to link to relevant specs, for
example [1].
Jim Ancona
[1] XML 1.0 Reccomendation, Appendix F: Autodetection of Character
Encodings, http://www.w3.org/TR/REC-xml/#sec-guessing
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]