[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Java/Unicode brain damage
- From: David Brownell <david-b@pacbell.net>
- To: xml-dev@lists.xml.org
- Date: Fri, 27 Jul 2001 17:41:52 -0700
I think Tim's response was closest to hitting the issue
that I was thinking about. There are lots of senses of
"character". Merging some of Tim's and John's input,
at least (!) these senses are in common use:
- jchar (Java "char") ... ~UCS-2 character, which in
very early days seems to have meant "Unicode" (1.0?);
- xchar (XML Character) ... ~Unicode character,
one or two "jchar" (Miles called this "uchar");
- graphemes (typographic/display) ... 1-to-N xchar.
John's examples of complex graphemes (some European
scripts, Korean, Indic, Tibetan, ...) are probably worth
looking at in the current Unicode book, for anyone who
hasn't seen that already ... :)
"jchar" arrays (including java.lang.String) clearly don't
talk in terms of single "character" unless you're talking
in the restricted sense of "jchar" (or Win32's version
of the C/C++ 'wchar_t'), or are content with:
> The worrying thing is that for 99.9999999999% of all
> real-world XML processing, if you pretend that a jchar
> represents an xchar, you won't get in any trouble.
Twelve-nines ... whoa! :)
That's probably true for graphemes as well, unless
you're working in scripts such as those which John
mentioned. I'm not sure I'd buy twelve-nines though!
> So
> I bet there's a huge amount of java code out there right
> now that makes this assumption. I don't think we have
> much understanding now as to what flavor of breakage is
> apt to occur when (if) non-BMP data starts flowing
> through such code. -Tim
Depending on how much work they do with those
jchars, and what kind, maybe no breakage at all.
Just don't assume that "character" and "jchar" are ever
going to be the same. People dealing with graphemes
(details of display/output) are likely very conscious of
such issues already.
- Dave