RE: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8")

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

RE: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?

From: "Rudick, Tom" <tmrudick@mitre.org>
To: <xml-dev@lists.xml.org>
Date: Thu, 20 Sep 2007 11:36:23 -0400

Hello All,
 
I have been following this thread regarding XML documents and their
character encodings.  I still don't quite understand how to tell what
the encoding of an XML document is when there is no external
information to go on.  
 
As discussed, you can either specify an encoding via HTTP headers
(externally), or in the XML document instead (internally).
 
If the HTTP headers do not indicate what the encoding of the document
is, you must read the document (at least the first line) and figure out
what the encoding is.  However, how is this accomplished?  If you don't
know the encoding of the document to begin with, how can you read even
the first line?
 
After reading this http://www.w3.org/TR/REC-xml/#sec-guessing, it seems
that instead of reading what <?xml encoding="utf-8"?> has to say,
parsers simply look at the first few octets of the document and compare
it to several known encodings of the text <?xml.  Then, they just
continue to read the rest of the document.  If parsers never actually
use the encoding attribute, is then any reason to have it other than
for human-readability?
 
Are there any encodings that have the same encoding of <?xml but
completely different encodings for other characters?

Does anyone have any further information on how exactly XML parsers
auto-detect character encodings within XML documents?
 
Thanks,
-Tom

-----Original Message-----
From: David Carlisle [mailto:davidc@nag.co.uk] 
Sent: Thursday, September 20, 2007 10:03 AM
To: Costello, Roger L.
Cc: xml-dev@lists.xml.org
Subject: Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g.
encoding="UTF-8") put Inside the XML Document?



> 
> An XML Parser will make an initial "guess" of the encoding based upon
> the presence or absence of a Byte Order Mark (BOM). The XML parser
then
> interprets the bit strings using that guess up to the first ">"
> character (the end of the XML declaration).
> 

If the encoding isn't known in advance then (in theory)  you don't know
where the first > is (as you don't know  how > is encoded)


> Now that it knows the "real" encoding it interprets the rest of the
> document using the encoding it found in the XML declaration.

That still makes it sound as if the encoding declaration is read using
a
different encoding from the rest of the document. Once an encoding has
been determined then the encoding declaration line itself must be
consistent with that encoding. You can't use one byte per character
ascii
<?xml version="1.0" encoding="utf-16"?>
and then read the rest of the file using two (or four) bytes per
character.

Suppose I have an encoding "my-encoding" that's the same as as ascii
except that > and < are swapped round. then the following is a well
formed document

>?xml version="1.0" encoding="my-encoding"<
>foo<hello>/foo<


The parser knows it's been handed an xml file, can tell that it's not
going to parse as utf8 so there must be an xml declaration, so the
first
tfew bytes must encode "<?xml" it sees the bytes it sees and the only
encoding it knows about in which that sequence encodes  "<?xmlis the
"my-encoding" encoding so proceeds on that basis, which means it
successfullt finds  encoding="my-encoding" and knows all is well...

David

_______________________________________________________________________
_
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 
_______________________________________________________________________
_

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php

Follow-Ups:
- RE: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
  - From: "Michael Kay" <mike@saxonica.com>
- Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
  - From: David Carlisle <davidc@nag.co.uk>
- Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8")put Inside the XML Document?
  - From: Philippe Poulard <philippe.poulard@sophia.inria.fr>

References:
- Why is Encoding Metadata (e.g. encoding="UTF-8) put Inside the XML Document?
  - From: "Costello, Roger L." <costello@mitre.org>
- RE: [xml-dev] Why is Encoding Metadata (e.g. encoding="UTF-8) put Inside the XML Document?
  - From: "Michael Kay" <mike@saxonica.com>
- Re: [xml-dev] Why is Encoding Metadata (e.g. encoding="UTF-8) putInside the XML Document?
  - From: Jonathan Robie <jonathan.robie@redhat.com>
- Re: [xml-dev] Why is Encoding Metadata (e.g. encoding="UTF-8) putInside the XML Document?
  - From: "Rick Jelliffe" <rjelliffe@allette.com.au>
- [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
  - From: "Costello, Roger L." <costello@mitre.org>
- Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
  - From: David Carlisle <davidc@nag.co.uk>
- RE: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
  - From: "Costello, Roger L." <costello@mitre.org>
- Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
  - From: David Carlisle <davidc@nag.co.uk>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]