XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Why is Encoding Metadata (e.g. encoding="UTF-8) put Inside the XML Document?

In article <B8415163A689094689542C617ECA036601E3A42B@IMCSRV5.MITRE.ORG> you write:
>Typically XML and HTML documents are exchanged on the Internet using
>the HTTP protocol.

If you mean "of XML documents exchanged on the internet, most are
exchanged using the HTTP protocol", this may well be true.  But if
you mean "most uses of XML documents involve exchange on the internet",
I am more doubtful.

Most of my XML processing is of local documents, and having the encoding
data embedded (and maintained by XML tools) is a big advantage compared
with plain text.

>Here's how: all XML documents must begin with this XML declaration:
>
>    <?xml version="1.0" encoding="..."?>

It would be more accurate to say that all XML documents must be either
encoded in UTF-8, or have that declaration.  It's also allowed for the
encoding to be provided by external means.  For example, if a document
is being served by HTTP then it need not have an encoding declaration
because the HTTP header gives the encoding.  Of course, if the HTTP
server get it from a file on disk we're back in the situation you
describe.

>These are all ASCII characters.  Thus, an XML parser opens the
>document, interprets the bit strings as ASCII characters up to the
>first ">" symbol.

No!  The characters are all ones present in the ASCII character set,
but the declaration must be in the same encoding as the file.  A
UTF-16 file has its XML declaration in UTF-16, not ASCII.  What you
say is only true for ASCII supersets like UTF-8 and Latin-*.

XML parsers typically examine the first few bytes to determine the
encoding sufficiently to read the declaration; if there is a
declaration the first two characters must be less-than, question-mark
so this is fairly straightforward.  This will be enough to decide
whether it's an ASCII superset, UTF-16 or -32 (and determine the byte
order), or even EBCDIC.  It's a well-formedness error if the encoding
specified by the declaration isn't compatible with the encoding of the
declaration itself.

-- Richard
-- 
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS