OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XML Blueberry (non-ASCII name characters in Japan)

At 4:01 PM -0400 7/9/01, John Cowan wrote:
>Elliotte Rusty Harold wrote:
>> No, I was specifically thinking about the additional Han ideographs
>> for Japanese and Chinese.
>I admit that the case for including these specifically is much
>weaker than for the languages with previously unencoded scripts.
>However, in for a penny, in for a pound (see below); it will
>not cost much more to handle them all.

I agree. If you can convince me that there's a clear need to break XML 1.0, then there's no reason not to fix all the other encodings as well. It's just that so far I haven't seen that one clear reason to break it. (I do think NEL should die on its own demerits irrespective of the other encodings.)

>> For the scripts you mention it would be
>> enough to list the languages along with some documentation of the
>> number of people who speak them, and prefer to write code in their
>> native tongues.
>Until now, many of these people have been unable to write code
>in their native tongues.  Further, I don't think that writing markup
>is to be identified with writing code, though I admit that it is
>different from writing plain text.  Plenty of people can and do mark up
>documents structurally who are utterly innocent of programming.

Granted. Mainly I'm trying to make the distinction between writing and writing markup. Writing markup is not the same thing as programming, but it is a specialized technical activity that is performed by a small minority of users. Even HTML, the most successful markup language ever, is actually typed by fewer and fewer users every day, who are using tools like DreamWeaver instead. XML will be no different in this respect. Most users will interact with it through GUIs, and never need to see their markup. That makes native language markup a lot less important than it otherwise would be. 

>> It might or
>> might not have an established Roman transliteration. The web sites
>> I looked at were unclear on this point.
>> Yi is definitely different. There is an established Roman based
>> alphabet for it.  It may not be the preferred script for all Yi
>> speakers, but it's adequate for markup.
>That is no argument.  There are several established Latin-script
>transliterations for Greek, and every educated Greek-speaker knows and
>uses the Latin script, but Greeks want to write Greek in the Greek
>script.  Wherefore it is encoded in Unicode and other encodings,
>and allowed in XML names.

But these are not the same thing. Greek was in Unicode 2.0, and therefore could be included in XML names without significant cost. Yi is not in Unicode 2.0 and therefore cannot be included in Unicode names without significant cost. Nobody is arguing that Yi should be kept out on its own demerits, or that it would have been kept out if Unicode 3.0 had been finished before XML 1.0. But the question we have to answer today is whether there is sufficient benefit to adding the Yi script today, to justify breaking the entire existing XML infrastructure, and introducing more incompatibility into the XML world. Given that the Yi language can be used in XML markup today, even if the Yi script can't, I don't think the possible benefits outweigh the costs. 

>> In any case, Burmese and Khmer are genuinely different scripts that
>> don't seem to have accepted mappings into any other scripts.
>> However, they are both relatively small scripts that can fit into
>> the upper half of a one-byte font even if the purported character
>> set is something else completely like 8859-1. In fact, I suspect
>> that's how they're used today.
>I don't understand the relevance of this.  Adding so much as one
>Unicode character breaks compatibility just as much as adding
>40,000 plus.  Why are relatively small scripts to be privileged
>in this process? Or are these users to be stuck with font-kludge
>encodings forever?  (More accurately, are they to use Unicode
>for plain-text documents, but not for marked-up ones?)

You keep ignoring the costs. This is not a win-win situation. It's a win-lose. By modifying XML to support these scripts you gain some benefit for some class of users. But you incur costs on some other, possibly overlapping class of users. As yet the class that benefits is entirely theoretical. Nobody has yet managed to show me even a single person who wants and needs to use these characters in markup, much less a large group of such people. Before the investment is made, I want a reasonable estimate of the expected return backed up by some hard data.

One of the reasons these scripts were not in Unicode 2.0 is that the Unicode consortium will not encode a script without the participation of and communication with experts in the script in question. Sometimes it means it takes longer to encode the script, especially for less computerized societies, but it does mean they avoid a lot of mistakes. Aside from Japanese and Chinese (for which the weakest cases have been made) we don't seem to have heard from experts in any of the languages in question about this. 

I'd be a lot more convinced by one professor of comp sci at Phnom Penh University telling me her students needed to write markup in Khmer or one IT staff person at the Addis Ketema telling me that they employed monolingual secretaries who needed to markup documents in Amharic than I would by all the postings from all the Anglophones we've seen so far. 

>> Let's try and put some numbers on this. For Burmese, Dhivehi, Khmer,
>> and the Ethiopic languages we're probably talking in the ballpark
>> of 100 million people. (source: Kenneth Katzner, Languages of the
>> World). Of these 100 million people how many of them are likely to
>> write markup?
>Quien sabe?  If you build it, they will come.

Do you really think that all 800 million Spanish speakers are going to start writing markup in Spanish? That all one billion Chinese speakers are going to start writing markup in Chinese? It's ridiculous to assume that more than a tiny minority of speakers are going to write markup in any language, no matter how well supported. 

>> I suggest that a reasonable means of answering at least the latter
>> of these questions would be to investigate the computer science
>> programs in the countries and regions where these languages are
>> spoken. If they're taught primarily in Burmese, Dhivehi, Khmer,
>> etc. then I think it's plausible to assume that markup writers in
>> these languages would use their native tongue.
>Again, I think this connection between markup and programming
>is unwarranted.  I can throw a stone from where I am sitting
>now and hit several persons who can do markup perfectly well
>but cannot program at all.

OK. So propose an alternative. How do you suggest proving that there's a genuine need to write markup in these scripts? Don't ask me to accept it on faith. I don't. 

| Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
|          The XML Bible, 2nd Edition (Hungry Minds, 2001)           |
|              http://www.ibiblio.org/xml/books/bible2/              |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      | 
|  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/     |