[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Closing Blueberry

From: Joel Rees <rees@server.mediafusion.co.jp>
To: Elliotte Rusty Harold <elharo@metalab.unc.edu>
Date: Mon, 23 Jul 2001 17:13:14 +0900
Elliotte,

You've written a pretty good summary (of the visible tip of the iceberg) of
the long term problem. You are correct that fixing the character set to a
specific version of UNICODE is going to continually force the problem to be
re-visited.

About new words in Japanese, borrowed words are usually written in katakana,
as you seem to be aware. (This is current practice, which reverses practice
around the turn of the last century, which was also a reversal of practice.
Interesting history there.) Some borrowed words, like "double" {daburu} and
"software bug" {baggu} become so accepted that native speakers write them in
hiragana (and/or begin to conjugate them as Japanese). If the level of
acceptance is high, these words can be written with existing Kanji, implying
a new reading to the Kanji. In some cases, new Kanji are invented for these
words, but that is rather rare, at least at this time. It definitely doesn't
survive into generally available computerized documents.

It might interest you to know that some native Japanese words have never
been assigned Kanji, at least, not according to the dictionaries. However,
with the proliferation of word processors, it has become much easier to use
Kanji that one doesn't really remember how to write by hand. This means that
many words that have been traditionally written in hiragana, except by
professors and people putting on airs, are now commonly typed as Kanji in
word processor documents.

But the bulk of new ideographs, as I understand it, are in specialty fields,
highly technical terms for which phonetic kana would simply not carry the
semantic load. (This is similar to our turning to Latin for names of newly
discovered species and diseases.) Most people do not need these new
characters, but the people who need them really do need them.

Current work-arounds for the new technical Kanji are all proprietary. The
researchers have some gaiji ("foreign character"=private use character)
editor, their group all use the same word processor, and they pass around
their file of gaiji. As you might imagine, this drives the choice of word
processor for the research group, and it tends to discourage use of personal
equipment in research. When they publish, their printing company has to
build a one-shot gaiji font.

I think a real solution to the creativity problem requires a fundamental
shift in the method of encoding ideographs. At a minimum, we have to be able
to define new characters on the fly, complete with parsing information. We
have some of the technology for on-the-fly character image definition and
rendering, but it's expensive. We don't have the technology to handle adding
parsing information to characters on the fly, but I think XML/SGML and
UNICODE are finally giving us the tools to take it on. It might not be that
hard. Encoding the on-the-fly characters for transmission will require a
base of known characters and a common method for attaching the definitions
of the non-common characters used.

(UNICODE might provide the base set, but it feels like a poor fit to me --
too many universally defined characters, among other things. Moving the
burden of the on-the-fly characters onto XML, as some have suggested, would
of course add another parse layer, and would definitely require a new
version number.)

But I think blueberry shouldn't need to wait for on-the-fly character
creativity.

I wonder if the version number or XML declaration could be modified to
include a field for specific UNICODE version number reference, as has been
alluded to on the list. A simple linear progression, tying character issues
to version number, seems too limited.

I'm sure this has been considered, but what would the arguments against
declaring the UNICODE version number in the encoding clause would be?

<xml version="1.0.1" encoding="UNICODE-3.1">

No, we are going to want to be able to do something like

<xml version="1.0.1" encoding="mojikyo" encoding-reference="UNICODE-3.1">

Joel Rees
programmer -- rees@mediafusion.co.jp
----------------------------------------------------
To be a tree supporting all information,
  giving root to the chaos
    and branches to the trivia,
      information breathing anew --
        This is the aim of Yggdrasill.
============================XML as Best Solution===
Media Fusion Co. ,Ltd.  株式会社メディアフュージョン
Amagasaki  TEL 81-6-6415-2560    FAX 81-6-6415-2556
    Tokyo　TEL 81-3-3516-2566  　FAX 81-3-3516-2567
                       http://www.mediafusion.co.jp
===================================================


Elliotte Rusty Harold continued the conversation:

> At 9:10 AM +0900 7/20/01, Murata Makoto wrote:
>
> >As for the Japanese language, I believe that I have demonstrated
> >reasons: changes of unification and made-in-Japan Kanji require
> >non-BMP name characters.  If Unicode becomes popular and we
> >continue to use XML 1.0, disallowed CJK ideographics will become
> >traps.
> >
>
> I'm starting to realize there may be a deeper issue here. Languages
> evolve. It's the nature of the things.There are dozens of new words
> every year, some to describe new technologies like fax machines
> and e-mail, some that get adapted from other languages (glasnost
> in English, le weekend in French), others that just arise. ("Doh" just
> made it into the Oxford English Dictionary.) I'd be surprised if
> Japanese and Chinese are any different in this respect.
>
> In alphabetic languages like English and Russian, new words are no
> big deal. They fit right into Unicode and XML with no hassle. But
> what happens in ideographic languages? I know Japanese uses
> Katakana for some of these words. Is it all of them? How many
> new ideographs come into use each year? And in Chinese? I
> suspect it's even worse, but I would appreciate hearing from the
> Chinese speakers on the list.
>
> For the sake of argument, say we had perfect knowledge and
> could fix XML and Unicode so that it did cover all current
> ideographs used today in Chinese and Japanese, what do we
> next year? and the year after that? and the one after that?
> and every year for the next ten thousand years?
>
> Unicode's answer is that someone fills out the right forms, proves
> that the characters are being used, and then they're added.
> There's plenty enough space in Unicode to handle several hundred
> new characters a year for the next ten thousand years. As I think
> Simon originally suggested we could just tie XML to Unicode and
> leave it at that.
>
> However, any fixed solution along the lines of XML 1.0 is
> guaranteed to fail, especially if the criterion for success is that all
> characters anyone wants to use but be available for use in XML
> names. We and our descendants will be revisiting these arguments
> every five years for the next few millennia.
> --

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
|          The XML Bible, 2nd Edition (Hungry Minds, 2001)           |
|              http://www.ibiblio.org/xml/books/bible2/              |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      |
|  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/     |
+----------------------------------+---------------------------------+
Prev by Date: Re: Blueberry is not "closed" (was: Closing Blueberry)
Next by Date: RE: Collected Works of SAX
Previous by thread: Re: Closing Blueberry
Next by thread: Extreme Paper at Extreme 2001
Index(es):
- Date
- Thread