[
Lists Home |
Date Index |
Thread Index
]
- From: Miles Sabin <msabin@cromwellmedia.co.uk>
- To: 'David Brownell' <db@Eng.Sun.COM>
- Date: Wed, 3 Mar 1999 12:03:45 -0000
David Brownell wrote,
> Put it this way: if you assume UTF-16, you're
> safe either way because UTF-16 is a superset.
Err ... is that true?
Maybe I'm being a bit obsessive about my
interpretation of the various standards docs, but
as far as I can see UCS-2 isn't a subset of
UTF-16. The BMP S-zone codes (D800-DFFF) are
undefined but reserved in UCS-2, and so should
not occur in a purportedly UCS-2 stream. I would
expect a processor which encountered such codes to
either,
1. Spit out an error and give up.
or,
2. Quietly ignore them and continue processing
with the next 2 octets.
Obviously these codes are defined and legal
in UTF-16, so an incorrect assumption of UTF-16
when the stream was in fact broken UCS-2 would
produce unpredictably incorrect behaviour (ie.
the processor might continue processing a broken
doc in an indeterminate way).
In any case, on a less finickety note, I'd quite
like to be able to compute string lengths UCS-2
style where that's appropriate, because 2*byte-
length is a bit simpler than the UTF-16
equivalent ;-)
Anyway, here's a slightly updated version of a
proposal I mailed to Tim Bray yesterday ...
In the absence of an appropriate MIME header
the octet sequences,
1. FE FF
2. FF FE
3. 00 3C 00 3F
4. 3C 00 3F 00
may be inferred to be,
1. big-endian indeterminately encoded 2 octet
characters.
2. little-endian indeterminately encoded 2 octet
characters.
3. BOM-less big-endian indeterminately encoded 2
octet characters.
4. BOM-less little-endian indeterminately encoded
2 octet characters.
If either of the following PIs are found,
<?xml version="1.0" ?>
<?xml version="1.0" encoding="UTF-16"?>
or, in cases (1) and (2), if *no* PI is found,
then encoding is resolved to UTF-16. Otherwise
if,
<?xml version="1.0" encoding="ISO-10646-UCS-2"?>
is found then encoding is resolved to UCS-2.
This very complicated and isn't a zillion miles away
from the current handling of UTF-8 vs. ISO 8859-x
vs. US-ASCII.
Cheers,
Miles
--
Miles Sabin Cromwell Media
Internet Systems Architect 5/6 Glenthorne Mews
+44 (0)181 410 2230 London, W6 0LJ
msabin@cromwellmedia.co.uk England
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)
|