xml-dev - Character Encoding and the XML PR (was Re: PR.xml)

Character Encoding and the XML PR (was Re: PR.xml)

[ Lists Home | Date Index | Thread Index ]

From: David Megginson <ak117@freenet.carleton.ca>
To: xml-dev@ic.ac.uk
Date: Fri, 16 Jan 1998 11:38:21 -0500

Peter Murray-Rust writes:

 > Thanks. I am also aware of it now :-).  Can I make the assumption that:
 > 
 > 	- ISO-8859-1 and UTF-8 look identical to not-very-experienced humans.

They look identical to most English speakers, but differ in their
treatment of accented characters (> 0x7f), so French and German
speakers probably notice.

 > 	- in principle I should be able to sort this by adding something like
 > 
 > <?xml version="1.0" encoding="ISO-8859-1"?>
 > 	to the top of the document

Correct.  The other alternative is to configure your web server to
send the encoding ISO-8859-1 in the HTTP header for this document if
the text/xml MIME type is approved, but the problem will reappear if
you download the file and the parse it on your own system.

 > 	- in practice this fails because by the time it gets to the encoding
 > declaration it has already assumed the encoding is UTF-8 and has crashed :-)

It should not fail with AElfred -- I just downloaded the PR and added
your XML declaration to the top, and AElfred reported no errors.  

In fact, the XML declaration is guaranteed to use only ASCII
characters, which are the same in UTF-8 and ISO-8859-*.  AElfred is
very careful not to try to read too far until the document until it
has discovered whether there is an explicit encoding declaration.

 > I am not quite clear why we need this problem. Do different tools emit
 > different encodings? If so, what should I work with?. Can I convert this
 > document? 

ISO-8859-1, which is used for most web pages, contains characters only
for Western European languages.  UTF-8 can encode any Unicode
characters up to 0xff (and a little higher with surrogates), so it can
handle Kanji, Han Chinese, Arabic, etc.  The PR rightly specifies that
any entity that begins with neither an encoding declaration nor a
byte-order mark (for UCS-2) should be assumed to be encoded in UTF-8.

Conversion should be fairly simple -- take a look at the AElfred
source to see how the different encodings are constructed.  Just for
the record, AElfred accepts the following encodings, and to my
knowledge, supports them completely and correctly to the extent
allowed by Java's 16-bit characters and by surrogates:

- UTF-8
- ISO-10646-UCS-2 (both byte orders)
- ISO-10646-UCS-4 (four byte orders)
- UTF-16
- ISO-8859-1

All the best,

David

-- 
David Megginson                 ak117@freenet.carleton.ca
Microstar Software Ltd.         dmeggins@microstar.com
      http://home.sprynet.com/sprynet/dmeggins/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)

Follow-Ups:
- Re: Character Encoding and the XML PR (was Re: PR.xml)
  - From: James Clark <jjc@jclark.com>
- Re: Character Encoding and the XML PR (was Re: PR.xml)
  - From: Henning Behme <hb@ix.heise.de>

References:
- PR.xml
  - From: Peter Murray-Rust <peter@ursus.demon.co.uk>
- PR.xml
  - From: David Megginson <ak117@freenet.carleton.ca>
- Re: PR.xml
  - From: Peter Murray-Rust <peter@ursus.demon.co.uk>

Prev by Date: Re: PR.xml
Next by Date: Re: PR.xml
Previous by thread: Re: PR.xml
Next by thread: Re: Character Encoding and the XML PR (was Re: PR.xml)
Index(es):
- Date
- Thread