[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Where does a parser get the replacement text for a characterreference?
- From: Lars Marius Garshol <larsga@garshol.priv.no>
- To: xml-dev <xml-dev@lists.xml.org>
- Date: Thu, 05 Jul 2001 11:00:24 +0200
* David Brownell
|
| I think Lars and I are agreeing ...
It does sounds suspiciously like it, yes. No reason to be
disappointed, though. I'm sure we can find something we do disagree
on. :-)
* Lars Marius Garshol
|
| That is true, though one would assume that this would not necessarily
| be possible. If the character could be expressed in the local
| character encoding, why was it encoded with a character reference in
| the first place?
* David Brownell
|
| If the text were encoded in UTF-8 for interchange purposes, then any
| given local system might use different encodings ...
It might, and indeed I've written code to decode UTF-8 into local
encodings several times. When doing this, however, one always runs the
risk that there will be characters in the input that cannot be
represented in the output.
| there must be some convention to establish agreement on what a given
| private-use character means. Presumably folk who work with systems
| using those characters could describe how they work. A few years
| back, I heard questions about how such conventions ought to be
| structured.
The Unicode standard does subdivide the the privat use area into
different parts for different uses but I don't know enough about this
to say much more.
* Lars Marius Garshol
|
| No. Surrogate pairs are an artifact of the UTF-16 character encoding
| and conceptually they do not exist outside it.
* David Brownell
|
| More or less; the Unicode spec defines surrogates, and what pairing them
| means.
The definition of the UTF-16 encoding does, yes. Surrogates are not
Unicode characters, however, and encoding a pair of them using UTF-8
or UTF-32 is not (AFAIR) legal, much less meaningful.
The recent UTF-8S proposal requires using surrogates instead of
encoding code points directly, but this is controversial for several
reasons, one of which is that this is simply importing the problems
with UTF-16 into UTF-8, which previously did not have them.
* David Brownell
|
| But equating Unicode with UTF-16, to match common usage (and clearly
| not wearing my pedantic hat :) that point is not going to be
| understood very widely, because ...
* Lars Marius Garshol
|
| In other words 𐐖 does not refer to a surrogate pair; it
| refers to U+10416, DESERET CAPITAL LETTER JEE.
* David Brownell
|
| ... that is _represented_ as a "surrogate pair" in Java and many other
| programming environments: two Java "char" values are needed to
| represent a single (up one level) "character".
I agree that most people thoroughly confuse UTF-16, UCS-2 and Unicode,
and I think that dates from the time when the Unicode people
themselves did not distinguish between the encodings and the character
set. Probably the lack of a need for such a distinction when working
with western encodings has contributed to the problem.
This is the very reason I responded to your message, though, since I
think that confusion needs to be corrected.
* Lars Marius Garshol
|
| They can be represented in UTF-16 as one or two 16-byte units, but
| UTF-16 and Unicode are not the same. Unicode is the character set,
| UTF-16 is one of its (too) many encodings.
* David Brownell
|
| But a "char"acter in Java (or wchar_t on Win32) is a 16-bit (not
| byte :) unit, hence the semantic confusion when you talk about a
| "character".
It is a source of confusion, I agree, and all the more reason to clear
it up. :-)
--Lars M.