OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Blueberry is not "closed"



0> In article <5.1.0.14.2.20010724225913.020f5760@pop.intergate.ca>,
0> Tim Bray <URL:mailto:tbray@textuality.com> ("Tim") wrote:

Tim> Ouch, it's worse than I thought.  One of the "nice" things about
Tim> the UTF16 surrogate system is that if you don't have the apparatus
Tim> around to deal with astral-plane chars, you can just obliviously
Tim> treat 'em as pairs of characters you don't know.

Except that you have to be careful about how you count "characters".


Tim> But XML carefully rules out that possibility, prod [2] for "Char"
Tim> rules excludes surrogate blocks.  In retrospect, maybe that was
Tim> dumb?

In a Java environment, it's sensible to pass around surrogates in String
objects - think of it as using UTF-16 as the internal representation,
which is trivial if the input is UTF-16 and (potentially) less trivial
otherwise.

Production [2] doesn't say anything about what happens internally, of
course, as this is external syntax - it rules out numeric character
references to the surrogate area, or surrogate characters in UCS-2,
etc.  This actually makes things easier for a Java implementation,
since whenever you see a character from the surrogate area, you know
it's being used as one half of a surrogate pair.


Tim> Which means in effect that Dave's right, basically you just totally
Tim> can't use a java's String or char in dealing with Blueberry docs.
Tim> Or am I missing something... please?

It seems that you might need to at least temporarily combine surrogates
whilst parsing (or write your parser such that UTF-16 state is taken
account of), but I don't think the parser would need to retain the
UCS-4 form, and it seems okay to pass UTF-16 to downstream components
(as long as you don't split surrogate pairs!).


Tim> Or re-open the door to the UTF-16 hack by putting the surrogate
Tim> blocks back into [2] as part of the Blueberry update.

Ugh!


I knew a 16-bit char type would be a nuisance before too long!

--