[
Lists Home |
Date Index |
Thread Index
]
(I smell a troll.)
Henri Sivonen wrote:
> BTW, is there any actual research about the demand for non-ASCII
> element names? XML 1.0 allows a large chunk of non-ASCII on element
> names. Is any real-world XML vocabulary actually exercising the
> freedom to go beyond ASCII in element and attribute names (except
> perhaps some vocabulary that is only used in Japan)?
What the **** does that question mean? That element names only used in
one country should not be supported in a standard designed to suit the
whole world? It is a simple fact that ASCII transliterations of many
languages, in particular those with tonal pronunciation, homophones and
idoegraphic scripts, can frequently be incomprehensible. (Add to this
that there are regional concepts (e.g. in addresses) for which there may
be no English analog.) The most direct way of putting the question is
"Why should W3C put out a standard that arbitrarily makes things easier
for white people than for yellow people?" A space can easily be replaced
by a "_": what should the ideograph for a mountain be replaced by: the
sound, the meaning, a translation? How does a reader reconstruct the
ideograph?
XML's name rules are important precisely because they don't adopt the
bogus minimalist approach. I am not saying that anyone who wants
ASCII-only markup is a greedy, lazy, selfish, unjust, uncaring,
clock-back-turning, unpragmatic racist or Western supremicist; on the
contrary, there are lots of reasons why an organization or individual
*should* use ASCII for Western and international document types. But ISO
standards like SGML must support International requirements, and W3C
profiles like XML must support world-wide adoption.
A less inflammatory response is that the importance of names in markup
is not that they are easy to write, but that they are meaningful to
read. The better analogy to make isn't the inconvience of making you
write ASCII, but the inconvenience if you had to write using, say, Greek
characters. You probably could do it, but it would add a layer of
inconvenience that would probably make you avoid using the technology
where you had a choice.
<boring_old_geezer_mode>I designed the original naming rules that XML
1.0 adopted pretty much intact. In 1994-5 or so, I had been given a
project by Allette Systems in Australia to figure out why adoption of
SGML was slow in East Asian countries. During this time I visited
several Asian countries (I had learned SGML while living in Japan
working in publishing) and made contacts with many people in publishing.
I made up something called the ERCS (Extended Reference Concrete
Syntax), with input from many people: Gavin Nicol, Tony Graham and James
Clark are three Western names who have posted to XML-DEV, for example.
Included on this list were some features that were adopted by XML (that
Unicode characters should be available regardless of the document format
using hex references, for example) in particular, to support "native
language markup". (Note: not "natural language markup") ERCS was
adopted by the SPREAD (Standardization Project Regarding East Asian
Documents) of the CJK DOCP (China/Japan/Korea Document Processing
Experts Group) which was a liason group between industry, academia and
standards bodies. That gave ERCS the credibility so it was already a
pretty workable package by the time XML came along. (The SPREAD
entities occassionally crop up but are obsoleted by XML: the W3C Charmod
spec mentions them though, which is nice.)
More info, if anyone cares, is at
http://xml.ascc.net/en/utf-8/ercsretro.html
Here is what the ERCS document, now 10 years old, says about Native
Language Markup:
"Much of the value of using SGML markup, especially for structure-based searches in hypertext, is that the tag names and other markup can have meaning to the
user rather than being cryptic mnemonics. This is most true for SGML documents that contain fielded data. So the provision of native-language tagging is a key
facility that SGML will need to supply to be successful.
"So the best concrete syntax for a given character set is one that does not artificially or gratuitously restrict what characters are available for use as
markup. In the absence of other factors, if a character appears in words in the native language, it should be available for use in NAMEs. And similarly, if a
symbol character is readily available from the keyboard, it should be available for use in short references.
In particular, it is important to recognize that NAMEs in XML/SGML are
not just used for element and attribute names, but also for IDs. An ID
is often taken from the value of content, which usually will be some
native language. If you have looked at C programes written by Chinese
or Japanese, for example, you will see that people like using their
native language (and script) for writing the names of things.
So the XML 1.0 naming rules are the end of a chain that began by looking
at how to solve an actual problem with the acceptability of ASCII-only
markup. A solution was worked in consultation with people from East Asia
and the ASCII West. It was adopted by standards bodies, and influenced
SGML, HTML and most directly XML. XML's enormous popularity has often
been attributed (notably by Tim Bray) in large part to its good
bottom-line for internationalization.
(XML 1.1 removes the checks on particular characters, but in the
direction of more openness, not more restrictiveness, which is good from
this perspective at least. But it is interesting to see someone saying
the world is not ready for names above the first seven bits of Unicode,
while XML 1.1 had discussion about whether the world was ready for names
above the first 16 bits of Unicode :-)
Murata Makoto is speaking at XTech 2005 in a fortnight's time on the
Japanese Goverment's adoption of XML. It will be interesting to see to
what extent they use ASCII.
Cheers
Rick Jelliffe
|