[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
RE: [xml-dev] BOM and encodings questions
- From: "Shlomo Yona" <S.Yona@F5.com>
- To: "Philippe Poulard" <Philippe.Poulard@sophia.inria.fr>
- Date: Thu, 8 Mar 2007 09:41:19 -0800
Hello,
Why is there a contradiction between BOM and UTF-8 encoding in the same XML document? Appendix E.1 of xml 1.1 standard explains how to "guess" the encoding using BOM.
I also didn't find any case other than external entities, but I can understand how someone will create an XML in encoding X but the data of some element <foo> will be in encoding Y, because this is a excerpt from a text file in some other encoding. It is fairly easy to implement a parser that is able to handle alternating encoding that can support such cases, but I couldn’t find this mentioned anywhere in the standard(s). I get to see a lot of XML documents that contain alternating encodings -- are they not well formed? If so, then well formedness is probably very much misunderstood when it comes to character encodings... in my opinion.
Shlomo.
-----Original Message-----
From: Philippe Poulard [mailto:Philippe.Poulard@sophia.inria.fr]
Sent: ä 08 îøõ 2007 19:22
To: Shlomo Yona
Cc: xml-dev@lists.xml.org
Subject: Re: [xml-dev] BOM and encodings questions
Shlomo Yona wrote:
> .1.
>
> If an XML document starts with the FF FE BOM (UTF-16, little endian) but
> the encoding is set to “UTF-8” in the prolog, what is the expected
> behavior of the Parser?
>
> I think that the parser should respect the BOM, read the prolog assuming
> it is encoded in UTF-16 little endian and then process the remaining of
> the XML document in UTF-8 as the prolog says.
>
> Is this correct?
I'm not sure, but a BOM can't be used with UTF-8, so the parser should
fail to decode the prolog, as the characters expected should be UTF-16
encoded : "<?xml " would be interpreted as 3 characters
>
> .2.
>
> Is an XML parser expected to process a document in alternating
> encodings? I mean, is there a way to signal the parser that from a
> certain point on the encoding changes to some other encoding? If so, how?
the only case I know is with external entities : each can have its own
encoding that may be different from the document's one
>
> .3.
>
> Is there a way to express the expected encoding of the XML document in
> the XML Schema? If so, how?
too late : XML Schema works at the logical level
I don't know why you try to enforce an incoming document to be encoding
with a given one, let the parser do the job and fail normally if it is
not supported
However, a SAX parser can supply informations about the encoding of a
document, so you can write a filter like this :
if encoding != THE_ENCODING
then fail_for_an_obscure_reason()
endif
--
Cordialement,
///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]