OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Where does a parser get the replacement text for a characterreference?

* David Brownell
| I think Lars and I are agreeing ... 

It does sounds suspiciously like it, yes. No reason to be
disappointed, though. I'm sure we can find something we do disagree
on. :-)

* Lars Marius Garshol
| That is true, though one would assume that this would not necessarily
| be possible. If the character could be expressed in the local
| character encoding, why was it encoded with a character reference in
| the first place?
* David Brownell
| If the text were encoded in UTF-8 for interchange purposes, then any
| given local system might use different encodings ... 

It might, and indeed I've written code to decode UTF-8 into local
encodings several times. When doing this, however, one always runs the
risk that there will be characters in the input that cannot be
represented in the output.

| there must be some convention to establish agreement on what a given
| private-use character means.  Presumably folk who work with systems
| using those characters could describe how they work.  A few years
| back, I heard questions about how such conventions ought to be
| structured.
The Unicode standard does subdivide the the privat use area into
different parts for different uses but I don't know enough about this
to say much more.

* Lars Marius Garshol
| No. Surrogate pairs are an artifact of the UTF-16 character encoding
| and conceptually they do not exist outside it.
* David Brownell
| More or less; the Unicode spec defines surrogates, and what pairing them
| means.  

The definition of the UTF-16 encoding does, yes. Surrogates are not
Unicode characters, however, and encoding a pair of them using UTF-8
or UTF-32 is not (AFAIR) legal, much less meaningful.

The recent UTF-8S proposal requires using surrogates instead of
encoding code points directly, but this is controversial for several
reasons, one of which is that this is simply importing the problems
with UTF-16 into UTF-8, which previously did not have them.

* David Brownell
| But equating Unicode with UTF-16, to match common usage (and clearly
| not wearing my pedantic hat :) that point is not going to be
| understood very widely, because ...

* Lars Marius Garshol
| In other words 𐐖 does not refer to a surrogate pair; it
| refers to U+10416, DESERET CAPITAL LETTER JEE.
* David Brownell
| ... that is _represented_ as a "surrogate pair" in Java and many other
| programming environments:  two Java "char" values are needed to
| represent a single (up one level) "character".

I agree that most people thoroughly confuse UTF-16, UCS-2 and Unicode,
and I think that dates from the time when the Unicode people
themselves did not distinguish between the encodings and the character
set. Probably the lack of a need for such a distinction when working
with western encodings has contributed to the problem.

This is the very reason I responded to your message, though, since I
think that confusion needs to be corrected.
* Lars Marius Garshol
| They can be represented in UTF-16 as one or two 16-byte units, but
| UTF-16 and Unicode are not the same. Unicode is the character set,
| UTF-16 is one of its (too) many encodings.
* David Brownell
| But a "char"acter in Java (or wchar_t on Win32) is a 16-bit (not
| byte :) unit, hence the semantic confusion when you talk about a
| "character".

It is a source of confusion, I agree, and all the more reason to clear
it up. :-)

--Lars M.