OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?

> These are all ASCII characters. Thus, an XML parser opens the
> document, interprets the bit strings as ASCII characters up to the
> first ">" 

No as was said earlier, the first few bytes of the file do not need to be
read as ascii. (And must not be for several popular encodings such as
utf-16 for example)

It's true that the characters  that appear in an encoding declaration
are characters that do have an ASCII encoding, but there is no
requirement that the byte sequence that represents the encoding
declaration uses the ASCII encoding.

  These are all ASCII characters. Thus, an XML parser opens the document,
  interprets the bit strings as ASCII characters up to the first ">"
  character. From then on, it interprets the rest of the document using
  the encoding it finds in the XML declaration. 

The entire document, including the encoding declaration, is read
using the same encoding.

> Algorithm for Detection of the Character Encoding when there is no
> Internal Encoding Label

That isn't the same as the algorithm given in XML.
There, if there is no external metadata or xml declaration the file has
to be in utf16 or utf8, and the BOM is optional for utf8, so if the file
has no BOM, then the parser does not "give up" The file is treated as if
utf8 is specified.

Recommendation 3
  HTTP Header: specifying the encoding in an HTTP header is
  unreliable. When exchanging XML or HTML documents using the HTTP
  protocol, don't specify the Content-Type in the HTTP header. This will
  force applications to look inside the document for encoding

is explictly the opposite of the  the RFC that defines the XML mime
types, so while there are arguments on both sides I think its dangerous
to state it as such a clear recommendation. In eth case of text/* mime
types (at least) I believe that the default charset is latin-1 so
effectively you _can't_ omit the charset: even if you don't specify it
explictly the receiver is supposed to act as if iso8859-1 is specified
(which will mean that if you don't specify a charset in the mime headers
then any utf8 document that has a non ascii character in it will be
parsed as  iso8859-1 and generate a fatal encoding error....


The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS