[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Java/Unicode brain damage
- From: Elliotte Rusty Harold <elharo@metalab.unc.edu>
- To: xml-dev@lists.xml.org
- Date: Wed, 25 Jul 2001 10:22:40 -0400
At 11:13 PM -0700 7/24/01, Tim Bray wrote:
>Which means in effect that Dave's right, basically you just
>totally can't use a java's String or char in dealing with
>Blueberry docs. Or am I missing something... please? Or
>re-open the door to the UTF-16 hack by putting the
>surrogate blocks back into [2] as part of the Blueberry
>update.
>
>Er, is anyone in the Java language team on top of what
>Unicode's up to? This is a real problem.
>
>Somebody ship some Prozac over to Elliote before he goes
>critical... -Tim
I'm afraid the Java mess sent me over the edge a long time ago. :-) I've actually given quite a lot of thought to that problem in other forums, and for a while I was even arguing that JDOM needed to replace the String class in order to be XML compliant. However, on further reflection I decided maybe the problem wasn't quite that bad. It's still pretty bad, but it's not insurmountable.
The Java way to handle this is to stop thinking of a Java char as representing a Unicode character. It doesn't. A Java char represents a UTF-16 code point, which may be a surrogate. The public API to java.lang.String is essentially a UTF-16 API. For example, the length() method of a string does not return the number of Unicode characters in the string. Rather it returns the number of UTF-16 code points. A string containing a single Plane-1 character has length 2 in Java.
This is inconvenient as all get out, but as long as you realize what's going on and code carefully, it's not necessarily wrong. Java is just providing a less than ideal representation of strings. For example, when a parser or other method is checking a string to see if it's a legal XML Blueberry name, it cannot simply pass each char in the String to an isBlueberryNameCharacter() method. Instead, it has to look at the whole string in toto and do its own decoding of surrogate pairs into Unicode characters before checking. The logic is much more complex, but it is doable, and it does work with existing Java APIs for processing XML.
FYI, I deliberately didn't bring this up previously, because even though Blueberry makes the problem worse, the problem still exists for element content and attribute values in XML 1.0. Furthermore, I think Java is broken enough here that Java needs to change. I don't think XML should be limited by this brain damage in Java. One silver lining to the Blueberry cloud might be that it could convince Sun to use a four-byte char like they should have back in 1995.
Although Java's the only language I'm intimately familiar with these days, I do think it would be informative to see how other languages handle these issues. Would anyone care to address the handling of non-BMP text in Python, Perl, C, C++, Fortran, AppleScript, Rexx, Delphi, Visual Basic, etc?
--
+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
| The XML Bible, 2nd Edition (Hungry Minds, 2001) |
| http://www.ibiblio.org/xml/books/bible2/ |
| http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/ |
+----------------------------------+---------------------------------+
| Read Cafe au Lait for Java News: http://www.cafeaulait.org/ |
| Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/ |
+----------------------------------+---------------------------------+