[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document?
- From: "Pete Cordell" <petexmldev@tech-know-ware.com>
- To: "Rudick, Tom" <tmrudick@mitre.org>, <xml-dev@lists.xml.org>
- Date: Thu, 20 Sep 2007 19:29:47 +0100
Not quite I'm afraid :-) You can also get little endian UCS-2 and little
endian UCS-4, UTF-16 little endian and various permutations thereof. e.g.
UCS-2 LE is 00111100 00000000
UCS-4 LE is 00111100 00000000 00000000 00000000
(although I think support for UCS-4 is optional.)
In our implementation we basically take the tables contained in
http://www.w3.org/TR/REC-xml/#sec-guessing and convert them into an if-else
based decision tree so that we can read a byte at a time and makes
successive deductions about the encoding in use. This is an implementation
issue though, and grabbing the first 4 bytes is also likely to work (subject
to there being 4 bytes available!).
Note also, that the prolog (the bit that may contain the xml-decl -
http://www.w3.org/TR/REC-xml/#NT-prolog) may just consist of white space,
hence the opening character may be a whitespace character also.
HTH,
Pete.
--
=============================================
Pete Cordell
Codalogic
for XML Schema to C++ data binding visit
http://www.codalogic.com/lmx/
=============================================
----- Original Message -----
From: "Rudick, Tom" <tmrudick@mitre.org>
To: <xml-dev@lists.xml.org>
Sent: Thursday, September 20, 2007 6:05 PM
Subject: RE: [xml-dev] [Summary] Why is Encoding Metadata (e.g.
encoding="UTF-8") put Inside the XML Document?
So we know that the first character in an xml document must be <.
Which has the ASCII value of 60.
So a parser will keep reading in bytes until it gets up to 60.
ASCII is 00111100
UCS-2 is 00000000 00111100
So with ASCII (or UTF-8), we encounter 60 which is in the first byte.
After that characters will be considered to be one-byte long until we
read in the correct encoding attribute.
With UCS-2, read up to 60, see that it took two bytes, and now all
characters are two-bytes long.
Is this correct?
Thanks again,
-Tom
-----Original Message-----
From: Philippe Poulard [mailto:philippe.poulard@sophia.inria.fr]
Sent: Thursday, September 20, 2007 12:00 PM
To: Rudick, Tom
Cc: xml-dev@lists.xml.org
Subject: Re: [xml-dev] [Summary] Why is Encoding Metadata (e.g.
encoding="UTF-8") put Inside the XML Document?
Rudick, Tom a écrit :
> If the HTTP headers do not indicate what the encoding of the document
> is, you must read the document (at least the first line) and figure
out
> what the encoding is. However, how is this accomplished? If you
don't
> know the encoding of the document to begin with, how can you read
even
> the first line?
>
> After reading this http://www.w3.org/TR/REC-xml/#sec-guessing, it
seems
> that instead of reading what <?xml encoding="utf-8"?> has to say,
> parsers simply look at the first few octets of the document and
compare
> it to several known encodings of the text <?xml. Then, they just
> continue to read the rest of the document.
Not exactly : the first few octets will indicate if <?xml
encoding="blah-blah"?> is coded on 1, 2, or even 4 bytes (for UCS) ;
the
charset of the sequence <?xml encoding="blah-blah"?> is limited to
ASCII-7 bits, which is fortunately compatible with UTF-8, ISO-8859-1
and
some others, and easily decodable if coded on 2 or 4 bytes, because the
same sequence is mapped to ASCII-7 bits, whatever the number of bytes
(zero-extension) ; for example :
Bits Encoding Hex Dec Char
7 US-ASCII 41 65 A 1000001
8 ASCII 8bits 41 65 A 01000001
16 UCS-2 41 65 A 00000000 01000001
32 UCS-4 41 65 A 00000000 00000000 00000000 01000001
So, the encoding can be read (if any)
I guess some parsers have additional heuristics for reading
successfully
the sequence <?xml encoding="blah-blah"?> ; maybe some try-catch to
apply with the set of charset they know ?
--
Cordialement,
///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !
_______________________________________________________________________
XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.
[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]