OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   RE: Encoding detection again ...

[ Lists Home | Date Index | Thread Index ]
  • From: Miles Sabin <msabin@cromwellmedia.co.uk>
  • To: 'David Brownell' <db@Eng.Sun.COM>
  • Date: Wed, 3 Mar 1999 12:03:45 -0000

David Brownell wrote,
> Put it this way:  if you assume UTF-16, you're
> safe either way because UTF-16 is a superset.

Err ... is that true?

Maybe I'm being a bit obsessive about my 
interpretation of the various standards docs, but 
as far as I can see UCS-2 isn't a subset of
UTF-16. The BMP S-zone codes (D800-DFFF) are 
undefined but reserved in UCS-2, and so should 
not occur in a purportedly UCS-2 stream. I would 
expect a processor which encountered such codes to

1. Spit out an error and give up.


2. Quietly ignore them and continue processing 
   with the next 2 octets.

Obviously these codes are defined and legal
in UTF-16, so an incorrect assumption of UTF-16
when the stream was in fact broken UCS-2 would
produce unpredictably incorrect behaviour (ie.
the processor might continue processing a broken
doc in an indeterminate way).

In any case, on a less finickety note, I'd quite
like to be able to compute string lengths UCS-2
style where that's appropriate, because 2*byte-
length is a bit simpler than the UTF-16
equivalent ;-)

Anyway, here's a slightly updated version of a 
proposal I mailed to Tim Bray yesterday ...

In the absence of an appropriate MIME header
the octet sequences,

1. FE FF 
2. FF FE
3. 00 3C 00 3F
4. 3C 00 3F 00

may be inferred to be,

1. big-endian indeterminately encoded 2 octet

2. little-endian indeterminately encoded 2 octet

3. BOM-less big-endian indeterminately encoded 2 
   octet characters.

4. BOM-less little-endian indeterminately encoded 
   2 octet characters.

If either of the following PIs are found,

  <?xml version="1.0" ?>
  <?xml version="1.0" encoding="UTF-16"?>

or, in cases (1) and (2), if *no* PI is found,
then encoding is resolved to UTF-16. Otherwise 

  <?xml version="1.0" encoding="ISO-10646-UCS-2"?>

is found then encoding is resolved to UCS-2.

This very complicated and isn't a zillion miles away 
from the current handling of UTF-8 vs. ISO 8859-x 



Miles Sabin                          Cromwell Media
Internet Systems Architect           5/6 Glenthorne Mews
+44 (0)181 410 2230                  London, W6 0LJ
msabin@cromwellmedia.co.uk           England

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS