[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Blueberry is not "closed"
- From: Toby Speight <firstname.lastname@example.org>
- To: XML developers' list <email@example.com>
- Date: Wed, 25 Jul 2001 13:48:01 +0100
0> In article <firstname.lastname@example.org>,
0> Tim Bray <URL:mailto:email@example.com> ("Tim") wrote:
Tim> Ouch, it's worse than I thought. One of the "nice" things about
Tim> the UTF16 surrogate system is that if you don't have the apparatus
Tim> around to deal with astral-plane chars, you can just obliviously
Tim> treat 'em as pairs of characters you don't know.
Except that you have to be careful about how you count "characters".
Tim> But XML carefully rules out that possibility, prod  for "Char"
Tim> rules excludes surrogate blocks. In retrospect, maybe that was
In a Java environment, it's sensible to pass around surrogates in String
objects - think of it as using UTF-16 as the internal representation,
which is trivial if the input is UTF-16 and (potentially) less trivial
Production  doesn't say anything about what happens internally, of
course, as this is external syntax - it rules out numeric character
references to the surrogate area, or surrogate characters in UCS-2,
etc. This actually makes things easier for a Java implementation,
since whenever you see a character from the surrogate area, you know
it's being used as one half of a surrogate pair.
Tim> Which means in effect that Dave's right, basically you just totally
Tim> can't use a java's String or char in dealing with Blueberry docs.
Tim> Or am I missing something... please?
It seems that you might need to at least temporarily combine surrogates
whilst parsing (or write your parser such that UTF-16 state is taken
account of), but I don't think the parser would need to retain the
UCS-4 form, and it seems okay to pass UTF-16 to downstream components
(as long as you don't split surrogate pairs!).
Tim> Or re-open the door to the UTF-16 hack by putting the surrogate
Tim> blocks back into  as part of the Blueberry update.
I knew a 16-bit char type would be a nuisance before too long!