xml-dev - Re: About sml and internationalization

Re: About sml and internationalization

[ Lists Home | Date Index | Thread Index ]

From: nisse@lysator.liu.se (Niels Möller)
To: Sean McGrath <digitome@iol.ie>
Date: 29 Nov 1999 17:06:18 +0100

Sean McGrath <digitome@iol.ie> writes:

> I am thinking about the issue to with allowing/disallowing
> sets of Unicode characters in element type names as per XML
> 1.0.
> 
> If SML has very few special tokens
> e.g. "<", "&" and whitespace, what would happen
> if any character outside this teeny weeny set is
> allowed in an element type name.

I would say this is the way to go. And I have seen it done before,
both with eight-bit charsets like latin1 andwith unicode.

It gives people the ability to shoot themselves in the foot by using
strange characters (my favourite is using non-breakable space in
variable names in emacs lisp). But I still think it is the way to go:
The parser and language can define a small set of characters as
special, and just pass on whatever is between those special characters
to the application.

If you think about it this way, most of the charset considerations can
be removed from the parser. Treat the input as a sequence of
non-negative integers (which may be 7, 8 or 36 bits wide, depending on
the application; if you think in C++, the parser could be a template
parameterized on the character type). If an application needs to
handle several charsets, it can use something like a content-type:
text/sml; charset = iso-8859-2 header to convert the input into
unicode before feeding it into the parser.

One could define the special characters more abstractly, and leave it
to the application to tell the parser how an "<" is represented today,
but I think that's overabstracting things. Using plain ascii values
(possibly embedded into an ascii superset like unicode or latin-2)
should be good enough.

This line of thinking also means that "whitespace", as far as the
parser is concerned, should be limited to a few ascii characters. SPC
and NL ought to be enough. To keep with tradition, perhaps TAB an CR
as well. Having the parser recognize all unicode whitespace characters
as adds some complexity. (There are 5 spacing control characters in
traditional ASCII, and ordinary space, non-breakable space (in latin-x
and unicode), and an additinal 18 in the rest of unicode. I.e 25 in
all).

/Niels

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)

Follow-Ups:
- DTD.com - new repository
  - From: Avi Rappoport <xml@searchtools.com>

References:
- About sml and internationalization
  - From: "Didier PH Martin" <martind@netfolder.com>
- RE: About sml and internationalization
  - From: Sean McGrath <digitome@iol.ie>

Prev by Date: Re: How to keep "useless" information with SAX (2?).
Next by Date: unsubscribe
Previous by thread: RE: About sml and internationalization
Next by thread: DTD.com - new repository
Index(es):
- Date
- Thread