[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Java/Unicode brain damage

From: Joel Rees <rees@server.mediafusion.co.jp>
To: xml-dev@lists.xml.org
Date: Fri, 27 Jul 2001 19:33:49 +0900

(A little off topic, but)
Duane Nickull asked for clarification:


[snipped]

> Do you know if the C++ STL operates in a similar fashion?  It is usually
>   a pain to write portable C and C++ programs supporting
> UTF-16. After the last Unicode conference,  I saw papers suggesting a
> language extension to support portable programs using UTF 16 with a
> C/C++ language extention.
>
> One of the main problem talked about was pertaining to literal strings.
>   While is is aparently not rocket science to compose portable C and C++
> programs using a fixed 16-bit (unsigned) integral data type as the
> character, it often means that you cannot use literal strings or the
> platform's
> runtime libraries.

Don't know much about stl or other C++ gadgetry. C++ and I don't get along
very well.

But it would be fairly easy to implement a fully UTF-32 character class in
C++. Burying it under an existing class library base like stl would be
tricky (and time consuming), but should not be impossible. Neither should it
be that difficult (even if time consuming) to extend existing UNICODE 2
classes to recognize surrogate pairs, which I would tend to think of as the
preferred option.

Hooking this to string literals would require the class initializers to
actively (but maybe at compile-time?) convert from the assumed (probably
UTF-8) encoding/transform for string literals to UTF-16. In other words, it
doesn't make much sense to extend the language itself to handle UTF-16
string literals.

And, like I tried to say, the usual idea in C is to assume UTF-8 for
Unicode. This also means that there is no need to mess around with the
standard itself, or with the hidden pseudo-standards that lie underneath.
All you need is an editor that allows you to edit the full range of
characters and save in UTF-8, and then you need to write or borrow a set of
extension libraries that work with the parts of Unicode you need, including
conversion to/from UTF-16 if you need that. All of the standard libraries
still work as advertised for the C locale. Again, no need to touch the
standard.

Wait. CodeWarrior's standard libraries on the Mac defines '\xc8' as a space
in the default C locale, and don't seem to automatically build in a Unicode
locale in the version I have handy, so you'd want to recompile the ctype
library when compiling for the pre-Mac OS X environment. It looks like Be
may have a similar problem. That's relatively trivial, although it requires
programmer intervention. Still no need to mess with the standard.

Re-writing the standard C library to handle UTF-8 in a Unicode C locale
might be fun. No. It _would_ be fun, if I could get someone to pay me for
it. Time consuming, yes, but fun. Aren't they already doing something like
this over at sourceforge? Okay, I just searched for "unicode library" and
got something a little more comprehensive (C++), but not the basic
libraries.

<interesting-thought>If you really want to work internally in UTF-16 or
UTF-32 in C, get yourself a processor with a 16- or 32-bit byte and a
Standard C for it.<truly-absurd>Or just artificially define your computer's
byte to be 16 or 32 bits wide, fudge your definition of pointers to ignore
the least significant bit(s), and then re-implement.</truly-absurd> (Sorry
about that.) If you go for a 32-bit char, then you need to widen your short
to at least 32, as well, but that's no problem.</interesting-thought>

HTH

Joel Rees
programmer -- rees@mediafusion.co.jp
----------------------------------------------------
To be a tree supporting all information,
  giving root to the chaos
    and branches to the trivia,
      information breathing anew --
        This is the aim of Yggdrasill.
============================XML as Best Solution===
Media Fusion Co. ,Ltd.  株式会社メディアフュージョン
Amagasaki  TEL 81-6-6415-2560    FAX 81-6-6415-2556
    Tokyo　TEL 81-3-3516-2566  　FAX 81-3-3516-2567
                       http://www.mediafusion.co.jp
===================================================

Follow-Ups:
- Re: Java/Unicode brain damage
  - From: David Brownell <david-b@pacbell.net>

References:
- Re: Blueberry is not "closed" (was: Closing Blueberry)
  - From: Rick Jelliffe <ricko@allette.com.au>
- Re: Blueberry is not "closed" (was: Closing Blueberry)
  - From: Ann Navarro <ann@webgeek.com>
- Re: Blueberry is not "closed" (was: Closing Blueberry)
  - From: Tim Bray <tbray@textuality.com>
- Re: Blueberry is not "closed" (was: Closing Blueberry)
  - From: Tim Bray <tbray@textuality.com>
- Java/Unicode brain damage
  - From: Elliotte Rusty Harold <elharo@metalab.unc.edu>
- Re: Java/Unicode brain damage
  - From: Joel Rees <rees@server.mediafusion.co.jp>
- Re: Java/Unicode brain damage
  - From: Duane Nickull <duane@xmlglobal.com>

Prev by Date: Re: quick XPath question: finding /a/b/c with a default namespace
Next by Date: Classifying NEL (FW: about section 3.9, line boundary, and NEL)
Previous by thread: Re: Java/Unicode brain damage
Next by thread: Re: Java/Unicode brain damage
Index(es):
- Date
- Thread