OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Java/Unicode brain damage

Inquiring minds and Elliotte Rusty Harold want to know:


> The Java way to handle this is to stop thinking of a Java char as
> representing a Unicode character. It doesn't. A Java char represents
> a UTF-16 code point, which may be a surrogate.


> Although Java's the only language I'm intimately familiar with these
> days, I do think it would be informative to see how other languages
> handle these issues. Would anyone care to address the handling of
> non-BMP text in Python, Perl, C, C++, Fortran, AppleScript, Rexx,
> Delphi, Visual Basic, etc?

Here's what I know about it:

(Hint -- somebody who really knows, correct me!)

CoBOL, as I understand it, hides (from) the problem by not telling anyone
what is happening underneath. Where you need anything beyond 7 bit, you use
"NATIONAL" characters, and you get whatever functionality the system gives
you. String functions are predefined, you just use them.

I once knew something about ForTran.

Delphi has wide characters that are (presently) 16 bits. If we try to deal
with anything beyond BMP, we usually use surrogate pairs. For some
intermediate operations, we do convert to UTF-32 (With 3.1, it's official
now!). For file I/O, we usually convert between UTF-16 Unicode and

The char in C is a byte, and most C libraries assume strings are built of
bytes, so C tends to use variable width characters.(Read that as UTF-8 for
Unicode.) You can't back up safely with shift-JIS, so you sometimes dump
things temporarily to fixed-width buffers when you need random access.
Although you can back up safely with UTF-8, it's still sometimes convenient
to temporarily dump a UTF-8 string to a constant width buffer. Since these
buffers are rather local in nature (can't be worked on by most of the
standard libraries at this time), widening them to 32 bits when 16 bits had
been used does not usually cause any ripples. Note that UTF-32 was not
official Unicode until 3.1, so the typing and other machinery for the 32-bit
temporary buffers is somewhat ad-hoc, not that it matters much.

I think I have seen 16 bit character string classes in C++. But these are

The manuals for Objective-C say that NSString conversion to UTF-8 is just a
copy. Apparently the general assumption is UTF-8.

Perl and Ruby (the language) both declare that you don't really know what a
string looks like inside, but they both are built on C and make heavy use of
pre-existing RE code, so we can assume they are presently handling things at
the byte string level. What I have heard in the Perl forums indicates they
are going with UTF-8 internally in most of the UNICODE support, but I may be
wrong. One thing about Perl, you can get just about anything you want.

BTW, variable width byte strings fit naturally with building character
classification tables in small chunks, which is useful for eliminating
redundant subtables.


Joel Rees
programmer -- rees@mediafusion.co.jp
To be a tree supporting all information,
  giving root to the chaos
    and branches to the trivia,
      information breathing anew --
        This is the aim of Yggdrasill.
============================XML as Best Solution===
Media Fusion Co. ,Ltd.  株式会社メディアフュージョン
Amagasaki  TEL 81-6-6415-2560    FAX 81-6-6415-2556
    Tokyo TEL 81-3-3516-2566   FAX 81-3-3516-2567