xml-dev - Re: Character Encoding Detection

Re: Character Encoding Detection

[ Lists Home | Date Index | Thread Index ]

From: Chris Maden <crism@ora.com>
To: xml-dev@ic.ac.uk
Date: Fri, 8 May 1998 14:10:05 -0400 (EDT)

[Chris Hubick]
> In the section on autodetection of character encodings the XML spec
> states "00 3C 00 3F: UTF-16, big-endian, no Byte Order Mark (and
> thus, strictly speaking, in error)"
> 
> 	My question is, why is this an error rather than a perfectly
> acceptable untransformed UCS-2 document?

The XML spec states, by fiat, in 4.3.3, that "Entities encoded in
UTF-16 must begin with the Byte Order Mark".  So the reason the
example is an error is because the spec says so.

UCS-2 is identical to UTF-16, and so it is subject (presumably) to the
same rule.

As a side note, I was unsure until just now whether they were
equivalent, but I finally found ISO 10646-1 clause 8:

   Plane 00 of Group 00 shall be the Basic Multilingual Plane (BMP).
   The BMP can be used as a two-octet coded character set in which
   case it shall be called UCS-2.

From:
   Linkname: ISO/IEC 10646-1 including AMD 1 thru 4
        URL: http://wwwold.dkuug.dk/JTC1/SC2/WG2/docs/N1396.doc

-Chris
-- 
<!NOTATION SGML.Geek PUBLIC "-//Anonymous//NOTATION SGML Geek//EN">
<!ENTITY crism PUBLIC "-//O'Reilly//NONSGML Christopher R. Maden//EN"
"<URL>http://www.oreilly.com/people/staff/crism/ <TEL>+1.617.499.7487
<USMAIL>90 Sherman Street, Cambridge, MA 02140 USA" NDATA SGML.Geek>

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)

Follow-Ups:
- Re: Character Encoding Detection
  - From: Chris Hubick <maillist@chris.hubick.com>

References:
- Character Encoding Detection
  - From: Chris Hubick <maillist@chris.hubick.com>

Prev by Date: Re: #cdata?
Next by Date: Comment and PI in a Mixed or children elements...
Previous by thread: Character Encoding Detection
Next by thread: Re: Character Encoding Detection
Index(es):
- Date
- Thread