[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Document encodings
- From: Rick Jelliffe <firstname.lastname@example.org>
- To: email@example.com
- Date: Fri, 06 Jul 2001 19:20:48 +0800
Yes. There are a succession of features looked at, one after another until a
fixed result is determined.
1) EXTERNAL: Information sent in the MIME header
2) BOM: Presence or absense of Byte Order Mark (BOM) which is a
Unicode signal that allows you to know if you are using
16 or 32 bit characters, and the "endianness"
3) FAMILY SIGNATURE: Presence of expected codes at the beginning of the
file (enough to know whether 8 bit codes are used, and
if they are ASCII-based or EBCDIC-based) for "<?xml"
4) ENCODING: knowing the family signature is enough to read
the encoding parameter of the XML header.
5) DEFAULT: otherwise UTF-8 (which also encompasses ASCII)
The important thing is that this is not guesswork. There is no scope for
one parser determining one encoding and another parser determining another
encoding: all XML processors should be able to say "Yes I can handle this
entity" or "no I cannot handle this entity".
All processors are required to support UTF-8 and UTF-16 encodings.
There are some character sets which have some instability about them:
see http://www.w3.org/TR/japanese-xml/ but this is an exception.
----- Original Message -----
From: "Phil Ruelle" <firstname.lastname@example.org>
Sent: Friday, 6 July 2001 PM 04:16
Subject: Document encodings
> A quick question:
> How do parsers work out what encoding an XML document is in
> (i.e. how is it able to read the 'encoding' attribute of the
> I'm guessing that all the encodings XML supports have a common
> 'root' so the XML declaration can always be read using the 'base'
> character set. Is this correct or am I way off the mark?
> Many thanks,
> Phil Ruelle
> The xml-dev list is sponsored by XML.org, an initiative of OASIS
> The list archives are at http://lists.xml.org/archives/xml-dev/
> To unsubscribe from this elist send a message with the single word
> "unsubscribe" in the body to: email@example.com