[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Where does a parser get the replacement text for a characterreference?
- From: David Brownell <firstname.lastname@example.org>
- To: xml-dev <email@example.com>
- Date: Wed, 04 Jul 2001 19:39:14 -0700
I think Lars and I are agreeing ... at which point this thread can become
a digression about "private use" characters and other ways that Unicode
wants to extend itself, or perhaps a discussion about how confusing the
word "character" can be.
> | Using the original U+E311 private-use character as an example, it
> | could be natural to have some component transcode it to the local
> | character set. That may be preferred for Klingon, or for other
> | characters that don't have code points in Unicode.
> That is true, though one would assume that this would not necessarily
> be possible. If the character could be expressed in the local
> character encoding, why was it encoded with a character reference in
> the first place?
If the text were encoded in UTF-8 for interchange purposes, then any
given local system might use different encodings ... there must be some
convention to establish agreement on what a given private-use character
means. Presumably folk who work with systems using those characters
could describe how they work. A few years back, I heard questions
about how such conventions ought to be structured.
> * Lars Marius Garshol
> | Character references always refer to Unicode characters.
> * David Brownell
> | Or surrogate pairs
> No. Surrogate pairs are an artifact of the UTF-16 character encoding
> and conceptually they do not exist outside it.
More or less; the Unicode spec defines surrogates, and what pairing them
means. But equating Unicode with UTF-16, to match common usage
(and clearly not wearing my pedantic hat :) that point is not going to
be understood very widely, because ...
> In other words
> 𐐖 does not refer to a surrogate pair; it refers to U+10416,
> DESERET CAPITAL LETTER JEE.
... that is _represented_ as a "surrogate pair" in Java and many other
programming environments: two Java "char" values are needed to
represent a single (up one level) "character".
> | -- they refer to ISO-10646 characters, which can be represented in
> | Unicode as one or two 16-byte units.
("they" being expanded character refs ... there are 10646 code points
that can't be represented in UTF-16, such as those using 5 and 6 byte
UTF-8 encodings ...)
> They can be represented in UTF-16 as one or two 16-byte units, but
> UTF-16 and Unicode are not the same. Unicode is the character set,
> UTF-16 is one of its (too) many encodings.
But a "char"acter in Java (or wchar_t on Win32) is a 16-bit (not byte :)
unit, hence the semantic confusion when you talk about a "character".
And it doesn't stop there ... :)