XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Why is Encoding Metadata (e.g. encoding="UTF-8) put Inside the XML Document?

Hi Folks,

Below I describe my understanding of:
1. Why the indication of how an XML document is encoded is placed
"within" the document, and
2. How an XML parser is able to parse an XML document before it even
knows its encoding.

I would appreciate any comments on where I err.  

------------------------------------------

It is considered best practice to embed within your document an
indication of the encoding used to create the document.

For example, in XML documents you put encoding information in the XML
declaration:

     <?xml version="1.0" encoding="UTF-8"?>

In HTML documents you put encoding information in the header section:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html;
Charset="UTF-8"  /> 

Why? Shouldn't metadata be external to a document?

Typically XML and HTML documents are exchanged on the Internet using
the HTTP protocol.  The HTTP header has a property to indicate the
charset (encoding) of its payload (i.e. the XML document or the HTML
document), e.g.

    Content-Type: text/xml; Charset="UTF-8"

Isn't the HTTP header sufficient to specify a document's encoding?

Suppose you have a big web server with lots of sites and hundreds of
pages, contributed by lots of people in lots of different languages.
The web server wouldn't know the encoding of each document.  

So it is considered best practice to specify the encoding within the
document itself.

But that raises an intriguing question: in order to read the document
you need to know what its encoding is, but to know what the encoding is
you must read the document! 

Stated differently, for an XML parser to know how to interpret the bit
strings in a document it must know the encoding, but to know the
encoding it must read the document!

We seem to have a chicken-and-egg situation.  How is this handled?

Here's how: all XML documents must begin with this XML declaration:

    <?xml version="1.0" encoding="..."?>

These are all ASCII characters.  Thus, an XML parser opens the
document, interprets the bit strings as ASCII characters up to the
first ">" symbol.  From then on, it interprets the rest of the document
using the encoding it found in the XML declaration.

Likewise, all HTML documents must begin with a header section:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html;
Charset="UTF-8"  />

These are all ASCII characters.  Thus, an HTML parser opens the
document, interprets the bit string as ASCII characters up to the end
of the header section.  From then on, it interprets the rest of the
document using the encoding it found in the meta tag.

---------------------

Do you agree?  /Roger


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS