[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: XML Blueberry (long response on CJK background)

From: Rick Jelliffe <ricko@allette.com.au>
To: xml-dev@lists.xml.org
Date: Tue, 10 Jul 2001 00:58:03 +0800
From: "Murata Makoto" <mura034@attglobal.net>

> Rick Jelliffe wrote:
>
> >Of these, most are CJK Unified Ideographs Extension B.
> >These are characters which must be considered bad practise
> >for use in markup, perhaps with some exceptions.   They are mostly
> > characters which readers may easily find confusing,
> >being archaic, regional, variant, uncommon or non-interoperable.
>
> This is completely different from what I have heard from CJK experts.
> Do you  have any supporting evidence?

1) To answer a question with a question first, have these experts also given
any indication of how many of the approx 71,000 Han ideographs in Unicode
3.1 are in *current* common use (not being personal names or place names)?

If we allow 12,000 characters in common use (i.e. where a substantial
proportion of the population could read and write them) in Taiwan, Japan,
and Korea each (surely a rather large figure) and no overlap (very
generous), that would still make 50% of the characters uncommon, merely on
rule-of-thumb.

I do have a number, at least for getting an inkling for Chinese use.
CCCII classifies:
   4,808 common Chinese characters
 17,032 less common Chinese
 20,583 rare Chinese characters (mostly variants?)
 11,517 simplified Chinese
gives about 59,000 characters. (However, I believe this usage is not
current usage, but usage from the historical sources. )

It is not impossible that the IRG has found an extra 10,000 Han characters
in common current use. But that still leaves, from the CCCII classification
at least, perhaps 20,000 to 40,000 characters that are less common or rare.


2) The IRG's unification principles do not include anything to remove
characters based on their rarity. A rare or archaic character included in a
source set will be included under the round-tripping rule.  Where the source
sets are small, then there will be fewer uncommon characters.  If large
sets, constructed on historic "catch-all" principles, are included, then
there must archaic, uncommon, regional, etc. characters.

If an expert is saying that archaic characters or uncommon characters are
not used, are they being removed by some undocumented protocol, or are
source sets with no archaic characters being considered, or is the expert
saying that there are no archaic or variant characters at all as some kind
of categorical statement?


3) In Unicode 3.1, an extra 42,711 Han characters are being added.

Of these, (all numbers +/- 2 counting error)
   30,713 are found in Taiwanese sources (CNS 11643 in particular)
   30,529 are found in mainland Chinese sources, most typically from the two
major lexicons (the KangXi and the HanyuDaZidian)
   4775 are from Vietnam.
   1088 are from Hong Kong
   303 are from Japanese sources
   160 are from South Korea  and 5,760 are from North Korea
(These are all not mutually exclusive.)

Lets look at these in more detail.

Hong Kong
-------------
I was told by a staff person of the Hong Kong government (who had some
involvement with GCCS) that most of the Hong Kong characters are connected
with place or personal names. I have not verified it, but that is what I was
told. These kinds of characters are unlikely to be used as element or
attribute names.  Hence the comment about "regional" characters.

Mainland Chinese
---------------------
There is obviously a lot of overlap between the mainland and Taiwanese
sources. I cannot count them readily, but at least 18,520 are the same
(and at most all of them).  (18486 is also about the same number as are
sourced from the KangXi, but this looks to be coincidence.) About 28922 of
the characters are sourced from the HanyuDaZidian.

Nevertheless, as Mainland China does not use traditional characters, and
limits the characters it does use, characters that come from China sources
from the dictionaries which are not from Taiwan, Japan, Korea or Vietnam
must be considered archaic.  This could be up to 10,000 of the characters,
on the numbers above. Hence the mention of "archaic".

Taiwan
--------
In the Taiwan sources, there are about 350 (?) characters which
http://www.unicode.org/unicode/reports/tr27/  states
  "CJK Compatibility Ideographs Supplement: U+2F800-U+2FA1D
  This block consists of additional compatibility ideographs required for
  round-   trip compatibility with CNS 11643-1992, planes 3, 4, 5, 6, 7,
  and 15. They  should not be used for any other purpose"
Presumably, use-in-XML-Names is such an "other purpose". These characters
are probably considered variants of mistakes.  Hence the mention of
"variants".

Vietnamese
--------------
It seems many (most? all?) of the Vietnamese characters are also found in
CNS (or in the Korean characters).

Japanese, Korean
---------------------
I leave the Japanese and Korean characters out. Most of the North Korean
characters are also found in CNS or a lexicon.

Comment
------------
We can attribute at least 30,000 of the characters in Unicode 3.1 as
characters which were considered variants or secondary by Unicode 2.0 and
3.0: the CNS characters.

These are characters which Unicode 3
http://www.unicode.org/unicode/uni2book/ch10.pdf
says could not be included because CNS (etc) used unification rules that
were "substantially different" from Unicode's.

So what has made these characters suddenly not dismissable variants but
needed characters? The paragraph  from
http://www.unicode.org/unicode/reports/tr27/ quoted above seems to hold the
answer: CNS11643 is now included in the list of round-trippable characters.

Even though Unicode 3.1 says that the same unification principles are being
applied as with Unicode 3.0 and Unicode 2.0, and even though 3.0 (and I
think 2.0) promised that no more characters would come in by the
round-tripping rule (p.259) in fact it looks like over 30,000 character have
come in en masse.

(Strictly, we can say that it is only the few hundred CJK Compatibility
Ideographs Supplement represent the characters which have come in for
round-tripping against previous announced policy. )

It looks rather like an embarrassing change of an announced policy, with
some face-saving wording.    Nevertheless, I would not question that it the
best policy: having grappled with the issue for so long, I am sure that the
IRG would have only made this change if they felt it was not warranted.  I
am not questioning that they are correct in their decision.

But I see nothing to go against my original statement: that with some
exceptions (e.g. the modest, additional Japan-sourced characters) it seems
that the CJK Unified Ideographs Extension B must indeed contain a
prepondance of uncommon, archaic, regional, and variant characters.

The function of markup is not to preserve historical characters, but to
communicate using common language, including common "terms of
art" which may well include otherwise unusual characters, to some extent.


5) But even in regard to jargon, there is a strong tendency in XML to
name things generically and to use attribute values to subclass (i.e.
"generic identifiers"). So it is more likely that we will have
  <zoo>
        <primate type="mandrill" />
  </zoo>
rather than
  <zoo>
       <mandrill class="primate" />
  </zoo>

Generic terms for the things we typically will use in markup is probably
quite a small set, and certainly made up of current and common words.
So even where there are uncommon characters used for terms of art, if they
are specific rather than generic they may still not be good for use in
markup (as element names and attribute names.)


6) The other aspect is the question "does the absense of these characters
prevent any markup?"  Given that mainland Chinese won't use them, Japanese
can use kana or variants, and traditional Chinese can spell the words using
the customary methods, it seems that not-having these characters does not
prevent native-language markup: it just makes it marginally less
satisfactory.  (Indeed, for Japanese it may be that an uncommon term of art
may be better understood by lay programmers when spelled out in kana than
written using an obscure kanji.)


7) Furthermore, to restate a previous comment, the greatest need for
native-script is to allow end-users to use native-language NMTOKENs
(enumerations) or IDs.  The availability of XML Schemas datatypes and richer
datatyping removes a lot of the commercial imperitive for native-script
element names and attribute names.

I don't believe that many people will accept that the extra characters are
required by pressing need or enormous benefit or blatent inequity.

But I don't think people necessarily have to accept the argument of benefit.
The extra characters can also be justified on their low cost instead: it
does not mean we all have to introduce 32-but characters: just that
surrogates get used. It can be implemented largely by removing some
constraints on which UTF-16 16-bit code-points are allowed, in those systems
which prevent surrogates in names currently.

Just because I, as a Westerner, cannot see much benefit is no reason why the
3.1 changes should not be adopted. I naturally want to err on the side of
conserving what we have, but perhaps it is better for us to err on the side
of  respect.

Cheers
Rick Jelliffe
Follow-Ups:
- Re: XML Blueberry (long response on CJK background)
  - From: John Cowan <jcowan@reutershealth.com>
References:
- Re: XML Blueberry
  - From: Murata Makoto <mura034@attglobal.net>
Prev by Date: Re:Which parser to use
Next by Date: RE: OS other than Win
Previous by thread: Re: XML Blueberry
Next by thread: Re: XML Blueberry (long response on CJK background)
Index(es):
- Date
- Thread