OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] Tokenizer question

[ Lists Home | Date Index | Thread Index ]

From: "zhengyu" <zhengyu@attbi.com>


> I was reading W3C documents early today. Boy, how complicated the
> character-set definitions are!!

> I can't help but wondering, does anyone really both implementing all these
> into their tokenizer at all, if they
> really do, how incredibly slow it is going to be?

Speed is not the only criterion for what makes a good markup languge.

In the XML rules, you only need to look for whitespace or delimiters
to scan incoming text to find a name.  You never need to check the
characters of an end-tag: you just need to match them against the
characters for the start-tag.  Most documents use ASCII or 
Latin1 characters-only for markup, so these only need a test for range
(<xFF) and a test on a single entry in a 256-entry table to determine,
and chances are much of the table will fit into a CPUs cache and so
not really cost that much.  It is prudent to disallow characters that
can be used as delimiters in other language  (of course <, >, &, %, ", ', ?, / 
for XML, and = for URLs, though the horse has bolted on -,:,- and _) 
and for digits, so you have to test for those characters in the ASCII range 
anyway.  

So actually the XML 1.0 names rules need cause no performance penalty 
for people who are just using ASCII or Latin 1 characters in names.
If they do, it is an implementation decision.

And for people using characters outside that range, if they are
using Chinese characters, then they are probably using half
the number of characters anyway, so the performance impact
of testing characters is relatively less.

What do you gain by these tests?

Here are five things:

1) Robustness by detecting some kinds of encoding errors
   - see http://www.topologi.com/public/XML_Naming_Rules.html

2) Baseline readability
   - no non-graphical characters are allowed, so you won't need a 
  hex editor to view what your names actually are. (Normalization
  is also appropriate for XML documents for the same reason.)

3) Near compatability with the Unicode Consortium's guidelines on
   characters suitable for identifiers.  As programming languages implement
   these guidelines more, XML names can be used as tokens in
   programming languages.

4)  Accessability.  Symbols and marks typically have no "reading" in
  speech synthesizers or Braille readers, so allowing such characters
  creates a disability where none needs to exist.  

5) A clear message to implementers that if they do not accept
characters outside ASCII in XML names, they do not conform.

So the rules provide a safety net, and then best practises can be
followed for the particular names chosen: for example to
use names taken from a single natural language.

Cheers
Rick Jelliffe




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS