xml-dev - Re: [xml-dev] Tokenizer question

Re: [xml-dev] Tokenizer question

[ Lists Home | Date Index | Thread Index ]

To: <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] Tokenizer question
From: "Rick Jelliffe" <ricko@allette.com.au>
Date: Sun, 14 Jul 2002 20:13:48 +1000
References: <2C61CCE8A870D211A523080009B94E430752B663@HQ5> <000e01c22b12$bc9237a0$483aea0c@attbi.com>

From: "zhengyu" <zhengyu@attbi.com>


> I was reading W3C documents early today. Boy, how complicated the
> character-set definitions are!!

> I can't help but wondering, does anyone really both implementing all these
> into their tokenizer at all, if they
> really do, how incredibly slow it is going to be?

Speed is not the only criterion for what makes a good markup languge.

In the XML rules, you only need to look for whitespace or delimiters
to scan incoming text to find a name.  You never need to check the
characters of an end-tag: you just need to match them against the
characters for the start-tag.  Most documents use ASCII or 
Latin1 characters-only for markup, so these only need a test for range
(<xFF) and a test on a single entry in a 256-entry table to determine,
and chances are much of the table will fit into a CPUs cache and so
not really cost that much.  It is prudent to disallow characters that
can be used as delimiters in other language  (of course <, >, &, %, ", ', ?, / 
for XML, and = for URLs, though the horse has bolted on -,:,- and _) 
and for digits, so you have to test for those characters in the ASCII range 
anyway.  

So actually the XML 1.0 names rules need cause no performance penalty 
for people who are just using ASCII or Latin 1 characters in names.
If they do, it is an implementation decision.

And for people using characters outside that range, if they are
using Chinese characters, then they are probably using half
the number of characters anyway, so the performance impact
of testing characters is relatively less.

What do you gain by these tests?

Here are five things:

1) Robustness by detecting some kinds of encoding errors
   - see http://www.topologi.com/public/XML_Naming_Rules.html

2) Baseline readability
   - no non-graphical characters are allowed, so you won't need a 
  hex editor to view what your names actually are. (Normalization
  is also appropriate for XML documents for the same reason.)

3) Near compatability with the Unicode Consortium's guidelines on
   characters suitable for identifiers.  As programming languages implement
   these guidelines more, XML names can be used as tokens in
   programming languages.

4)  Accessability.  Symbols and marks typically have no "reading" in
  speech synthesizers or Braille readers, so allowing such characters
  creates a disability where none needs to exist.  

5) A clear message to implementers that if they do not accept
characters outside ASCII in XML names, they do not conform.

So the rules provide a safety net, and then best practises can be
followed for the particular names chosen: for example to
use names taken from a single natural language.

Cheers
Rick Jelliffe

References:
- RE: [xml-dev] loosely and tightly coupled systems and type annota tion
  - From: "Bullard, Claude L (Len)" <clbullar@ingr.com>
- Tokenizer question
  - From: "zhengyu" <zhengyu@attbi.com>

Prev by Date: RE: [xml-dev] Tokenizer question
Next by Date: Schemas, transformations and algebras
Previous by thread: Tokenizer question
Next by thread: Re: [xml-dev] loosely and tightly coupled systems and type annota tion
Index(es):
- Date
- Thread