xml-dev - Re: Unicode confusion

Re: Unicode confusion

[ Lists Home | Date Index | Thread Index ]

From: "Peter S. Housel" <housel@home.com>
To: <xml-dev@ic.ac.uk>
Date: Tue, 4 Jan 2000 12:02:26 -0800

> No one's disagreeing with the use of Unicode; we're talking about
> which character encoding we'll use to represent it.  You can represent
> Unicode in variable-width 8-bit or 16-bit encodings or in fixed-width
> 32-bit encodings.

My reading of the Unicode 2.x standard is that the above isn't strictly
correct.  It is correct if you change "Unicode" to "the ISO 10646 Universal
Character Set" though.

> Note that Java uses UTF-16, which isn't quite fixed-width, though no
> one really notices.

It seems to me that Java uses Unicode, which maintains the semantics that 16
bits equals one character.  Surrogates are characters in Unicode, whereas
those code points are not legal UCS characters, but only artifacts of the
UTF-16 encoding.

Unicode looks like UTF-16, but the semantics are slightly different.  So a
file using UTF-16 encoding containing a single "astral plane" character of
the UCS would be interpreted by Unicode as a file containing *two* surrogate
characters.  (I think it's a strange tack to take, but it seems fairly clear
to me that this was their position as of Unicode 2.x.  I haven't looked at
3.0 yet, so things may have changed since then.)

The XML character set is the UCS, not Unicode.

Cheers,
-Peter-    housel@acm.org



xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)

References:
- No Subject
  - From: roddey@us.ibm.com
- Unicode confusion
  - From: David Megginson <david@megginson.com>

Prev by Date: Re: unicode confusion
Next by Date: Re: locally scoped element decls and namespaces
Previous by thread: Re: Unicode confusion
Next by thread: No Subject
Index(es):
- Date
- Thread