xml-dev - Re: Foreign Names

Re: Foreign Names

[ Lists Home | Date Index | Thread Index ]

From: Rick JELLIFFE <ricko@geotempo.com>
To: "XML Developers' list" <xml-dev@xml.org>
Date: Wed, 19 Apr 2000 13:43:26 +0800

Apart from alphanumerics, there are no characters that can be guaranteed
safe for any use: even ".", "-" and "_" are unsafe in a certain sense--
that sense is that if one tries to use XML identifiers as identifiers in
some target system (e.g. classnames, variable names, link names) then
the data is constrained by the lexical specs of the target language. 
XML has a nod in this direction by not allowing 01-9 as the first
character in a name, but even alphanumerics are not safe because a
processing language may have length rules.  SGML has a rule that, in its
default syntax, a name couldn't be longer than 8 characters for that
reason (this was a default that everyone immediately overrode: they had
to do this using the SGML Declaration which is a kind of 'features
manifest'--it was figuring out this declaration that gave people
migraines about SGML, it made things too abstract and variable). 
Similarly, case is dangerous.  

So the only way to "guarantee" potential interopability with unknown
software is to use the default SGML rules, and use only one case: so the
names must be formed by (picking lowercase)  [a-z][a-z01-9]{7}    I
think this is a prudent template to use for autokmatically generated IDs
in particular. 

When one uses namespace prefixes, then most bets are off for simple
translatability of XML names->target-language identifiers: the ":"
character or the two-part names both require some intermediate mapping
stage or some rearrangement of the design (i.e. passing the name as an
argument to a function rather than trying to evaluate it as the name of
a function).

Markup languages, as an approach, place a high premium on human
readability. But this is not the same as literate programming. The
identifier rules are not perfectly able to allow written identifiers in
the normal conventions of natural language (except perhaps for Chinese
written with no spaces or punctuation).  For example, in English we use
spaces to separate adjectives and nouns: in XML we have to use Camel
case or _-.  (German is better for this). In French, they may use
apostrophes inside words, and in English too: instead of <family><son's
dog/><family> we have to use <family><sonsDog/><family> or some
convention like that for the XML. This is why I try to use the term
"native language markup" rather than "natural language markup", to
emphasize that the identifiers will be to some extent artificial (i.e.,
not what the person would prefer to write or speak) but that the
important thing is allowing native expression.  The term Natural
Language Markup ropes in an entirely different set of concerns to Native
Language Markup.  In literate programming, we would expect that the
document forms sentences which make sense when read in the conventional
ways one might read literature: I don't think markup needs to promise
that. 

Instead, what Native Language Markup gives us is directness. If I am
making a DTD for use in sending data between surf-lifesaving clubs to
describe beaches, I can make an element <rip> without worrying that
perhaps some other country would use <undertow>. And if I am Malasian
and doing the same, I don't need to worry about what English says, I can
just use the Bahasa Malasia equivalent (their language: Basasa Malasia
and Bahasa Indonesia are both written using the Latin Alphabet).  And if
I am an Okinawan programmer, I can use the Japanese term (or the
Okinawan word if I am brave) using kanji (or kana).  It all depends on
the perceived usage of the document.  For temporary inhouse data where
the receiving application uses hashing or paths to find data (rather
than instantiating classes or calling functions with the same name as
the element) it would be more prident to make maximum use of Native
Language Markup rather than restrict it. 

But conservatism does have its place. Indeed, if one's model of how the
univers works includes the idea that everyone will be hidden from XML
markup and only see it through some UI, then XML could use any
characters for names (i.e. any range from the most conservative to the
most bizarre).   But those extreme have beed tried (limited names was
SGML's default: it caused SGML to fail in non-Latin-using countries;
unstandardized binary was tried and sometimes does not meet certain
life-time related requirements, such as when the documentation is lost
and people try to figure out what the format means) and XML is a
response to them.  But we can predict that there will always be attempts
to reduce XML back into a minimal ASCII format or bloat it into a binary
format.

Rick Jelliffe

***************************************************************************
This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/
***************************************************************************

References:
- RE: Foreign Names
  - From: "Don Park" <donpark@docuverse.com>
- Re: Foreign Names
  - From: Philip Nye <philipnye@freenet.co.uk>

Prev by Date: Re: XML document containing binary files data
Next by Date: Re: adding addressing capabilities to the DOM
Previous by thread: Re: Foreign Names
Next by thread: Re: Foreign Names
Index(es):
- Date
- Thread