[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: XML Blueberry (non-ASCII name characters in Japan)
- From: Joel Rees <firstname.lastname@example.org>
- To: email@example.com
- Date: Mon, 09 Jul 2001 15:17:48 +0900
I'm not a guru, but I agree with Mr. Murata's post. UNICODE still needs to
evolve, and XML must also evolve.
I'll offer a gaijin's pov on what he said. (I hope he doesn't mind.)
The comments concerning the use of Japanese in XML tags are not off the
mark. Some native engineers here do claim that Japanese should never be used
for identifiers. A typical argument might be the following: "A variable name
is a variable name. Uniqueness is all that matters. Its real purpose should
be noted in the documentation. An engineer who doesn't read the
documentation is not an engineer. Besides, all Japanese people read English,
and any programmer worth his salt is fluent in English."
In my observations, most Japanese people, even good engineers, try to avoid
having to read English. Identifiers composed of strings of Latin characters
that form words in English are to the average Japanese programmer only
slightly less opaque than arbitrary strings of alphanumeric.
The character KATAKANA MIDDLE DOT can be used to connect (or visually
separate for the sake of clarity) sub-strings of kana or Kanji. Such use is
relatively modern, but it is very much a correlary (although not an
equivalent) of the hyphen. If hyphen is acceptable as a name character in
XML, the middle dot should also be.
(I am not sure whether Korean or Chinese use the middle dot at all.)
Concerning the additions of UNICODE 3.1, maybe I can repeat an example?
Suppose that attributes such as "mellifluous" or "mellyfluous", or tags such
as <fluere> were arbitrarily rejected by your parser. (Yeah, I admit, I had
to dig around in my thesaurus and on www.m-w.com for about five minutes to
find these.) Suppose that, with UNICODE 3.0, "mellifluous" were accepted,
but not "mellyfluous" or "fluere". And then, with UNICODE 3.1, "mellyfluous"
and "fluere" are potentially acceptable (but not "mellifluus", and
definitely not "mellyfluus"). This is similar to the situation with the CJKV
As to whether an intermediate XML 1.1 is justifiable or not, I have no
opinion. But I would strongly encourage including the extensions for UNICODE
3.1 in XML 2.0, and there are some minor details, like the middle-dot issue,
where the UNICODE specification should not just simply be incorporated as
----- Original Message -----
From: "Murata Makoto" <firstname.lastname@example.org>
Sent: Sunday, July 08, 2001 1:40 AM
Subject: Re: XML Blueberry (non-ASCII name characters in Japan)
> > > So I think it would be appropriate, in this discussion,
> > > to have some people in the mainframe trenches give us
> > > a briefing on the scale and the difficulty of the problems
> > > they face, and for some of our i18n gurus to highlight
> > > the problems faced by an XML language designer who wants
> > > to use one of the newly-added languages.
> > I second this.
> Summary: Japanese characters have been heavily used for tag names
> and they have been very useful. Addition of more characters
> (CJK ideographics introduced in Unicode 3.1, etc.) is intensely
> 1. Current Status
> XML 1.0 provides name characters for the Japanese language. Since the
> inception of XML 1.0, people have used Japanese name characters
> for XML. I believe that such use is very common.
> Some people use Japanese name characters wherever possible. Reasons: (1)
> the Japanese language is natural for Japanese, (2) translation to English
> is sometimes impossible because of cultural differences , and (3) some
> (e.g., Buddhism research) are specfic to Japan or Asia.
> For example, an XML-based language for medical information uses
> Japanese name characters. This language has been designed by doctors
> who read and write English well. Nevertheless, they have chosen Japanese
> names because some terms simply cannot be translated to English.
> Buddhism researchers have created a few DTDs which heavily use non-ASCII
> name characters. Such names are very difficult to translate to English.
> Even when such translation is possible, these researchers want to use
> non-ASCII names very much.
> One of my DTDs is used for data interchange between two companies. This
> application is not experimenal but already plays a very important role in
> their main business. All tag names in this DTD use Japanese characters.
> As far as I know, they have not cause any problems. To the contrary,
> they are helpful in debugging, etc.
> Others discourage use of Japanese name characters. The reason is that
> some XML tools (e.g., CSS of Microsoft IE5.5) fail to support non-ASCII
> markup characters. I think that such XML tools are broken and we should
> try to change this situation.
> 2. Useful Additions.
> To my regret, KATAKANA MIDDLE DOT (which is used to connect two
> names) is missing in the list of name characters of Unicode 2.0 and
> thus it is also missing in XML 1.0. As a result, quite a few Japanese
> users have complained about this omission. Addition of this character
> will make a lot of Japanese users happier. To me, this is already
> a good enough reason to create XML 1.1.
> Unicode 3.1 allows so many CJK ideographics. Quite a few people expect
> that these characters will also be allowed as name characters.
> Unlike Rick Jelliffe, I don't agree that newly introduced CJK ideographics
> are archaic. First, national standards (e.g., JIS and CNS) have revisited
> unification: what was unified as a single character has occasionally
> two characters. One of the two characters has become a non-BMP
> character. Second, quite a few Chenam characters are non-BMP characters.
> Some of the compatibility ideographics, namely U+FA0E, U+FA0F, U+FA11,
> U+FA13, U+FAF14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, FA29,
> has become normal ideographics AFTER XML 1.0 was created. Addition of
> these characters is very useful.
> MURATA Makoto