[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Where does a parser get the replacement text for a characterreference?
- From: David Brownell <david-b@pacbell.net>
- To: Lars Marius Garshol <larsga@garshol.priv.no>,xml-dev <xml-dev@lists.xml.org>
- Date: Wed, 04 Jul 2001 18:33:39 -0700
> | I assume that it would depend on what encoding the xml that you are
> | parsing has.
>
> Actually, no.
More like: "sort of yes". Java developers tend to assume Unicode is
the universal way to represent character data, but folk working in other
languages may not be so fortunate. Parser APIs aren't required to
transcode into a UTF (UTF-8, UTF-16, UTF-32); they may deliver
characters in other encodings, including the input encoding.
Using the original U+E311 private-use character as an example,
it could be natural to have some component transcode it to the
local character set. That may be preferred for Klingon, or for
other characters that don't have code points in Unicode. (A while
back, I think Taiwan needed to use that approach; dunno if that's
less of an issue in 3.1 Unicode.)
> Character references always refer to Unicode characters.
Or surrogate pairs -- they refer to ISO-10646 characters, which can
be represented in Unicode as one or two 16-byte units. It's explicitly
illegal to have references to surrogate pairs, but characters in the
"Astral Planes" expand to two UTF-16 characters (or one UTF-32).
- Dave