OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Correct xml:lang value for Pinyin Chinese vs Simplified Chinese

On Tue, Feb 28, 2012 at 3:55 AM, John Cowan <cowan@mercury.ccil.org> wrote:
> Rick Jelliffe scripsit:

>> And it usually has no accents. (If it has accents, in particular macrons,
>> it may not be standard Pinyin, which is not to say that it might not
>> be an old or extended Pinyin.)
> Standard Hànyǔ pīnyīn (汉语拼音) as used by the PRC, Singapore,
> and ROC governments, and standardized as ISO 7098:1982, definitely does
> have accents: one for each syllable (except for the toneless syllables),
> as shown in this sentence.

Yes, though as Wikipedia says "The tone-marking diacritics are
commonly omitted in
popular news stories and even in scholarly works. An unfortunate
effect of this is the
ambiguity that results as to which words are being represented."
I said "usually", though someone could count occurrences (in the
press?) to resolve the issue better.

>> Language codes are in flux: the three letter codes and the two letter
>> codes have different approaches.
> Three-letter codes are never used for languages that have two-letter codes.
> Chinese as a whole has the two-letter code "zh", whereas Mandarin proper
> has the three-letter code "cmn".  For backward compatibility, "zh-cmn"
> also designates Mandarin.

The official language of PRC is Mandarin. The official script is
Simplified.   zh-CN means the Chinese as used in PRC: non-official and
regional languages require disambiguation, but I don't see why zh-CN
does, apart from cheese-paring.

http://www.w3.org/International/articles/language-tags/says "Avoid
region, script or other subtags except where they add useful
distinguishing information."   Because Han Unification did not unify
characters with different stroke counts (in the original source
standards) IIRC, then the use of zh with -Hans may not actually
provide much information useful to a renderer, but be more interesting
for bibliographic categorization.

says "If your application identified Mandarin Chinese in the past
using the language tag zh-CN (Chinese as used in Mainland China), or
even just zh, you can continue to use zh in this way. Using cmn or
cmn-CN may cause serious compatibility problems if the software or
users expect a tag such as zh."

>> Note that there is (or should be) no need to specify anything about
>> the script if you are just marking up existing text. @xml:lang
>> specifies the language, and the script only indirectly because a
>> language+region often has a standard or characteristic orthography:
>> the general script being used is obvious from the characters
>> themselves.
> You're out of date here.  xml:lang definitely can specify script, though
> it is not required to.

Yes I often am out-of-date!   s/specifies/typically specifies/
But the correct information needed in a language attribute depends on
the intent of the markup.  The region should not be dismissed as the
primary information of interest in marking up Chinese language, even
now that the borders are more open and computing is being done by
non-Mandarin speakers.

>> So you could use  xml:lang="zh-CN"  for all the three cases you
>> mention. If you wanted to give more of a hint, you could try
>> xml:lang="zh-CN-pinyin" or  "zh-Latn-CN-pinyin"  for the standard
>> pinyin,  and  xml:lang="zh-CN-pinyin-adhoc" or "zh-Latn-CN-adhoc" for
>> the non-standard one (where "adhoc" is some phrase you pick to
>> indicate an extended pinyin or mystery format.)

The problem I heard about tagging languages as pinyin (with no
language code), is that proper names (people, places) are often
transcribed phonetically from the speaker's local language, rather
than being read as Mandarin. Consequently, the further your text moves
from being straight Mandarin as written say by a Beijinger, then the
less that zh, zh-CN, and zh-Hans will be satisfactory.

"Since 1976, place names throughout China have been transliterated
into Pinyin so that they can be pronounced by local non-Mandarin
speakers. Thus, in Mongolia and Tibet, for example, Pinyin is the
system employed for spelling localities in phonetic form."

So the first thing to do is to determine whether the text has
region-specific features and, if it does, make sure that that the
language code reflects those: SG, TW, HK, etc.  The more that it is
official Beijing-style Mandarin with no phonetic proper names from
other languages or dialects, then the more that plain zh, zh-CN,
zh-Hans and zh-pinyin would be adequate, is my understanding.


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS