xml-dev - Character Encoding Detection

Character Encoding Detection

[ Lists Home | Date Index | Thread Index ]

From: Chris Hubick <maillist@chris.hubick.com>
To: xml-dev@ic.ac.uk
Date: Thu, 7 May 1998 22:38:08 +0000 (GMT)


	I am new to Character Encodings, and am trying to implement them
for my XML parser.

As I understand it, UCS has two flavors, UCS-2 and UCS-4, either of which
can optionally have a UCS transformation applied to them.  It is my
understanding that you could author an XML document in either of these,
without applying a transformation.

The UTF-16 spec at:
	http://www.stonehand.com/unicode/standard/wg2n1035.html
states:
	"In UTF-16, any UCS character from the BMP shall be represented by
its UCS-2 coded representation."

Now in UCS-2:
	'<' is 00 3C
	'?' is 00 3f

So the start of a UCS-2 or UTF-16 encoded XML document would be 00 3C 00
3F

In the section on autodetection of character encodings the XML spec
states "00 3C 00 3F: UTF-16, big-endian, no Byte Order Mark (and thus,
strictly speaking, in error)"

	My question is, why is this an error rather than a perfectly
acceptable untransformed UCS-2 document?

<?xml version="1.0" encoding="ISO-10646-UCS-2"?>


---
Chris Hubick
mailto:chris@hubick.com
http://www.hubick.com/



xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)

Follow-Ups:
- Re: Character Encoding Detection
  - From: Chris Maden <crism@ora.com>

Prev by Date: Re: parser for xml-data?
Next by Date: Re: #cdata?
Previous by thread: ANY
Next by thread: Re: Character Encoding Detection
Index(es):
- Date
- Thread