Re: [xml-dev] BOM and encodings questions

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: richard@inf.ed.ac.uk (Richard Tobin)
To: xml-dev@lists.xml.org
Date: Thu, 8 Mar 2007 22:01:12 +0000 (GMT)

In article <B546C312A37C12438A22154026CDC7E011ED9B16@exchfive.olympus.f5net.com> you write:

>If an XML document starts with the FF FE BOM (UTF-16, little endian) but
>the encoding is set to "UTF-8" in the prolog, what is the expected
>behavior of the Parser?

The BOM says that the document is in UTF-16.  If it isn't in UTF-16,
then it's broken at the encoding level, and this is a fatal error.

If it *is* in UTF-16, the encoding declaration is wrong.  This is a fatal
error unless there was some external indication (e.g. from HTTP) that
the document is supposed to be in UTF-16.

>I think that the parser should respect the BOM, read the prolog assuming
>it is encoded in UTF-16 little endian and then process the remaining of
>the XML document in UTF-8 as the prolog says.

No.  XML entities must be in a single encoding.  (The spec doesn't say
this explicitly, but it is clear that that's what's intended.)

>Is an XML parser expected to process a document in alternating
>encodings? I mean, is there a way to signal the parser that from a
>certain point on the encoding changes to some other encoding? If so,
>how?

An XML document can be made up of multiple entities which may have
different encodings.  There's no way to mix encodings in a single
entity.

>Is there a way to express the expected encoding of the XML document in
>the XML Schema? If so, how?

No, the schema is applied after parsing the document.

-- Richard
-- 
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.

References:
- BOM and encodings questions
  - From: "Shlomo Yona" <S.Yona@F5.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]