RE: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8")

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

RE: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?

From: "Rudick, Tom" <tmrudick@mitre.org>
To: <xml-dev@lists.xml.org>
Date: Thu, 20 Sep 2007 13:05:36 -0400

So we know that the first character in an xml document must be <.
Which has the ASCII value of 60.

So a parser will keep reading in bytes until it gets up to 60.  

ASCII is 00111100 
UCS-2 is 00000000 00111100

So with ASCII (or UTF-8), we encounter 60 which is in the first byte.
After that characters will be considered to be one-byte long until we
read in the correct encoding attribute.

With UCS-2, read up to 60, see that it took two bytes, and now all
characters are two-bytes long.

Is this correct?

Thanks again,
-Tom

-----Original Message-----
From: Philippe Poulard [mailto:philippe.poulard@sophia.inria.fr] 
Sent: Thursday, September 20, 2007 12:00 PM
To: Rudick, Tom
Cc: xml-dev@lists.xml.org
Subject: Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g.
encoding="UTF-8") put Inside the XML Document?

Rudick, Tom a �crit :
> If the HTTP headers do not indicate what the encoding of the document
> is, you must read the document (at least the first line) and figure
out
> what the encoding is.  However, how is this accomplished?  If you
don't
> know the encoding of the document to begin with, how can you read
even
> the first line?
>  
> After reading this http://www.w3.org/TR/REC-xml/#sec-guessing, it
seems
> that instead of reading what <?xml encoding="utf-8"?> has to say,
> parsers simply look at the first few octets of the document and
compare
> it to several known encodings of the text <?xml.  Then, they just
> continue to read the rest of the document.

Not exactly : the first few octets will indicate if <?xml 
encoding="blah-blah"?> is coded on 1, 2, or even 4 bytes (for UCS) ;
the 
charset of the sequence <?xml encoding="blah-blah"?> is limited to 
ASCII-7 bits, which is fortunately compatible with UTF-8, ISO-8859-1
and 
some others, and easily decodable if coded on 2 or 4 bytes, because the

same sequence is mapped to ASCII-7 bits, whatever the number of bytes 
(zero-extension) ; for example :
Bits Encoding    Hex Dec Char
  7   US-ASCII     41  65  A                              1000001
  8   ASCII 8bits  41  65  A                             01000001
16   UCS-2        41  65  A                    00000000 01000001
32   UCS-4        41  65  A  00000000 00000000 00000000 01000001

So, the encoding can be read (if any)

I guess some parsers have additional heuristics for reading
successfully 
the sequence <?xml encoding="blah-blah"?> ; maybe some try-catch to 
apply with the set of charset they know ?

-- 
Cordialement,

               ///
              (. .)
  --------ooO--(_)--Ooo--------
|      Philippe Poulard       |
  -----------------------------
  http://reflex.gforge.inria.fr/
        Have the RefleX !

Follow-Ups:
- Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g.encoding="UTF-8") put Inside the XML Document?
  - From: Tim Bray <Tim.Bray@Sun.COM>
- Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
  - From: "Pete Cordell" <petexmldev@tech-know-ware.com>
- Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
  - From: richard@inf.ed.ac.uk (Richard Tobin)

References:
- Why is Encoding Metadata (e.g. encoding="UTF-8) put Inside the XML Document?
  - From: "Costello, Roger L." <costello@mitre.org>
- RE: [xml-dev] Why is Encoding Metadata (e.g. encoding="UTF-8) put Inside the XML Document?
  - From: "Michael Kay" <mike@saxonica.com>
- Re: [xml-dev] Why is Encoding Metadata (e.g. encoding="UTF-8) putInside the XML Document?
  - From: Jonathan Robie <jonathan.robie@redhat.com>
- Re: [xml-dev] Why is Encoding Metadata (e.g. encoding="UTF-8) putInside the XML Document?
  - From: "Rick Jelliffe" <rjelliffe@allette.com.au>
- [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
  - From: "Costello, Roger L." <costello@mitre.org>
- Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
  - From: David Carlisle <davidc@nag.co.uk>
- RE: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
  - From: "Costello, Roger L." <costello@mitre.org>
- Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
  - From: David Carlisle <davidc@nag.co.uk>
- RE: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
  - From: "Rudick, Tom" <tmrudick@mitre.org>
- Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8")put Inside the XML Document?
  - From: Philippe Poulard <philippe.poulard@sophia.inria.fr>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]