OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Where does a parser get the replacement text for a characterreference?




* Ben Ryan
|
| I assume that it would depend on what encoding the xml that you are
| parsing has.

* Lars Marius Garshol
|
| Actually, no.

* David Brownell
| 
| More like:  "sort of yes".  Java developers tend to assume Unicode is
| the universal way to represent character data, but folk working in other
| languages may not be so fortunate.  

That is true. I must admit that I've worked enough with Unicode to
have brainwashed myself into thinking that Unicode is the one true way
to represent text.

| Parser APIs aren't required to transcode into a UTF (UTF-8, UTF-16,
| UTF-32); they may deliver characters in other encodings, including
| the input encoding.

They may. The interpretation of the character reference is determined
by Unicode, however, and is completely independent of the input
encoding of the document. So in that sense my statement stands. You
are of course right that this does not necessarily mean that your
application will receive this character encoded as a Unicode character.
 
| Using the original U+E311 private-use character as an example, it
| could be natural to have some component transcode it to the local
| character set.  That may be preferred for Klingon, or for other
| characters that don't have code points in Unicode.

That is true, though one would assume that this would not necessarily
be possible. If the character could be expressed in the local
character encoding, why was it encoded with a character reference in
the first place?

| (A while back, I think Taiwan needed to use that approach; dunno if
| that's less of an issue in 3.1 Unicode.)

One would assume so, given the addition of more than 40,000 new
chinese characters in Unicode 3.1. :-)

This issue is not likely to ever go away completely for living
ideographic scripts, however, since new characters keep being created
all the time, although at a slow pace.
 
* Lars Marius Garshol
|
| Character references always refer to Unicode characters.
 
* David Brownell
|
| Or surrogate pairs

No. Surrogate pairs are an artifact of the UTF-16 character encoding
and conceptually they do not exist outside it. In other words
𐐖 does not refer to a surrogate pair; it refers to U+10416,
DESERET CAPITAL LETTER JEE.

| -- they refer to ISO-10646 characters, which can be represented in
| Unicode as one or two 16-byte units.  

They can be represented in UTF-16 as one or two 16-byte units, but
UTF-16 and Unicode are not the same. Unicode is the character set,
UTF-16 is one of its (too) many encodings.

| It's explicitly illegal to have references to surrogate pairs, 

I guess that by this you mean that "it's explicitly illegal to refer
to characters as a pair of character references each referring to a
surrogate".

That is so because it does not make sense to import the UTF-16 kluge
that surrogate pair are into XML when one can refer directly to the
code point instead.

| but characters in the "Astral Planes" expand to two UTF-16
| characters

No, they are single characters, in UTF-16 represented by a pair of
16-bit code units.

| (or one UTF-32).

They are represented as a single 32-bit code unit in UTF-32, yes.

--Lars M.