[
Lists Home |
Date Index |
Thread Index
]
From: "zhengyu" <zhengyu@attbi.com>
> I was reading W3C documents early today. Boy, how complicated the
> character-set definitions are!!
> I can't help but wondering, does anyone really both implementing all these
> into their tokenizer at all, if they
> really do, how incredibly slow it is going to be?
Speed is not the only criterion for what makes a good markup languge.
In the XML rules, you only need to look for whitespace or delimiters
to scan incoming text to find a name. You never need to check the
characters of an end-tag: you just need to match them against the
characters for the start-tag. Most documents use ASCII or
Latin1 characters-only for markup, so these only need a test for range
(<xFF) and a test on a single entry in a 256-entry table to determine,
and chances are much of the table will fit into a CPUs cache and so
not really cost that much. It is prudent to disallow characters that
can be used as delimiters in other language (of course <, >, &, %, ", ', ?, /
for XML, and = for URLs, though the horse has bolted on -,:,- and _)
and for digits, so you have to test for those characters in the ASCII range
anyway.
So actually the XML 1.0 names rules need cause no performance penalty
for people who are just using ASCII or Latin 1 characters in names.
If they do, it is an implementation decision.
And for people using characters outside that range, if they are
using Chinese characters, then they are probably using half
the number of characters anyway, so the performance impact
of testing characters is relatively less.
What do you gain by these tests?
Here are five things:
1) Robustness by detecting some kinds of encoding errors
- see http://www.topologi.com/public/XML_Naming_Rules.html
2) Baseline readability
- no non-graphical characters are allowed, so you won't need a
hex editor to view what your names actually are. (Normalization
is also appropriate for XML documents for the same reason.)
3) Near compatability with the Unicode Consortium's guidelines on
characters suitable for identifiers. As programming languages implement
these guidelines more, XML names can be used as tokens in
programming languages.
4) Accessability. Symbols and marks typically have no "reading" in
speech synthesizers or Braille readers, so allowing such characters
creates a disability where none needs to exist.
5) A clear message to implementers that if they do not accept
characters outside ASCII in XML names, they do not conform.
So the rules provide a safety net, and then best practises can be
followed for the particular names chosen: for example to
use names taken from a single natural language.
Cheers
Rick Jelliffe
|