OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XML Blueberry (non-ASCII name characters in Japan)

John Cowan wrote:

>The solid reason is that there are people in this world who
>cannot write XML documents in their native language and writing

How many? You imply that numbers don't matter; that if there are two people whose native tongue is not represented (and whose native tongue will die with them) then we need to revise XML to support them. I'm not that much of a fanatic. I think that the number of people who speak a language, and who don't have a reasonable alternative, does matter. I do not accept a blanket claim that all living languages must be supported no matter how few people speak them and use them with computers. 

>> What words can be used that are not now used that people
>> would actually need to use in markup?
>Do you expect someone to generate a list of all the nouns, verbs,
>and adjectives in Amharic, Burmese, Canadian aboriginal languages, Cherokee, Dhivehi, Khmer, Oromo, Syriac, Tigre, and Yi?

No, I was specifically thinking about the additional Han ideographs for Japanese and Chinese. For the scripts you mention it would be enough to list the languages along with some documentation of the number of people who speak them, and prefer to write code in their native tongues. (Note it's important to distinguish between writing code and writing text. Most people do not and will not write markup in any language. The distinction keeps getting glossed over, but it's not as if Tigre, Yi, Khmer, and all these others can't be used today. They can be.) 

>> Of the scripts and languages in question,
>> the only one that gives me pause is Ethiopic because that's the
>> only one that has a large user community that is not yet adequately
>> (though perhaps imperfectly) addressed.
>What makes them superior in this respect to Burmese, Dhivehi, Khmer,
>or Yi?

Burmese is my mistake. I thought it was in Unicode 2.0, but apparently not. Ditto for Khmer.  Dhivehi I've never heard of, and it doesn't seem to be in Unicode 3.0. I can't find it in any of my references, at least under that name. Is it new in 3.1? Wait, I just found it on the Internet. It's called Thaana in Unicode, and is spoken in the Maldives by about 250,000 people. It might or might not have an established Roman transliteration. The web sites I looked at were unclear on this point. 

Yi is definitely different. There is an established Roman based alphabet for it.  It may not be the preferred script for all Yi speakers, but it's adequate for markup. 

In any case, Burmese and Khmer are genuinely different scripts that don't seem to have accepted mappings into any other scripts. However, they are both relatively small scripts that can fit into the upper half of a one-byte font even if the purported character set is something else completely like 8859-1. In fact, I suspect that's how they're used today. I know how that's how the Ethiopic languages are used, though in the case of Amharic, there actually are a few more characters than can be fit into one byte, which probably explains why Amharic fonts are so painful to work with today. 

Let's try and put some numbers on this. For Burmese, Dhivehi, Khmer, and the Ethiopic languages we're probably talking in the ballpark of 100 million people. (source: Kenneth Katzner, Languages of the World). Of these 100 million people how many of them are likely to write markup? Again, I'm talking about markup, not text. Of this percentage, how many would prefer to use their native language as opposed to Chinese, Arabic, English or something else?

I suggest that a reasonable means of answering at least the latter of these questions would be to investigate the computer science programs in the countries and regions where these languages are spoken. If they're taught primarily in Burmese, Dhivehi, Khmer, etc. then I think it's plausible to assume that markup writers in these languages would use their native tongue. On the other hand, if it turns out that some other language is the accepted language for technical communication within these countries, then I propose that it's not necessary in XML markup. 

| Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
|          The XML Bible, 2nd Edition (Hungry Minds, 2001)           |
|              http://www.ibiblio.org/xml/books/bible2/              |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      | 
|  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/     |