OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Where does a parser get the replacement text for a characterreference?

I think Lars and I are agreeing ... at which point this thread can become
a digression about "private use" characters and other ways that Unicode
wants to extend itself, or perhaps a discussion about how confusing the
word "character" can be.

> | Using the original U+E311 private-use character as an example, it
> | could be natural to have some component transcode it to the local
> | character set.  That may be preferred for Klingon, or for other
> | characters that don't have code points in Unicode.
> That is true, though one would assume that this would not necessarily
> be possible. If the character could be expressed in the local
> character encoding, why was it encoded with a character reference in
> the first place?

If the text were encoded in UTF-8 for interchange purposes, then any
given local system might use different encodings ... there must be some
convention to establish agreement on what a given private-use character
means.  Presumably folk who work with systems using those characters
could describe how they work.  A few years back, I heard questions
about how such conventions ought to be structured.

> * Lars Marius Garshol
> |
> | Character references always refer to Unicode characters.
> * David Brownell
> |
> | Or surrogate pairs
> No. Surrogate pairs are an artifact of the UTF-16 character encoding
> and conceptually they do not exist outside it.

More or less; the Unicode spec defines surrogates, and what pairing them
means.  But equating Unicode with UTF-16, to match common usage
(and clearly not wearing my pedantic hat :) that point is not going to
be understood very widely, because ...

>     In other words
> 𐐖 does not refer to a surrogate pair; it refers to U+10416,

... that is _represented_ as a "surrogate pair" in Java and many other
programming environments:  two Java "char" values are needed to
represent a single (up one level) "character".

> | -- they refer to ISO-10646 characters, which can be represented in
> | Unicode as one or two 16-byte units.  

("they" being expanded character refs ... there are 10646 code points
that can't be represented in UTF-16, such as those using 5 and 6 byte
UTF-8 encodings ...)

> They can be represented in UTF-16 as one or two 16-byte units, but
> UTF-16 and Unicode are not the same. Unicode is the character set,
> UTF-16 is one of its (too) many encodings.

But a "char"acter in Java (or wchar_t on Win32) is a 16-bit (not byte :)
unit, hence the semantic confusion when you talk about a "character".
And it doesn't stop there ... :)

- Dave