xml-dev - Re: Case sensitivity

Re: Case sensitivity

[ Lists Home | Date Index | Thread Index ]

From: Steve DeRose <Steven_DeRose@brown.edu>
To: xml-dev@lists.oasis-open.org
Date: Mon, 3 Apr 2000 12:44:37 -0400

At 10:27 AM -0400 4/3/00, Eric Bohlman wrote:
>On Mon, 3 Apr 2000, Stefan van den Oord wrote:
>
>> I have a simple question, I think: is XML case sensitive? In other words,
>> are the tags case sensitive? I also mean the <?XML... tag and the <!DOCTYPE
>> tag.
>
>Yes, XML names are case-sensitive (remember that they're not restricted to
>being English names, and many non-Western languages don't even have a
>concept of case-folding).

Your answer is of course correct (XML is case-sensitive); it is also true
that "many non-Western languages don't even have a concept of
case-folding". However, the second is not the reason for the first
(granted, you didn't actually say it is -- but a reader might well take it
that way).

Languages with no need for case folding are not much of  a problem: the
case-folding table or program would merely have no effect on characters
belonging to such languages: It would change 26 of our 26 alphabetic code
points, and no others. After all, in English we already use lots of
characters that don't get case-folded (like digits).

The serious problems are subtler:

The practical problem that with Unicode your folding table gets really big;
on the order of 128Kbytes instead of 256 bytes (barring compresson): this
is a pain on small devices (like a cell-phone browser), a pain to load, a
pain to implement compression for, etc.

The theoretical problem is more important: it's not the caseless languages
that pose problems, but the complicated case-folding ones. For example,
lots of languages only apply diacritical marks to lower-case letters: for
example, "a" may come with 6 different accents, but "A" takes none -- which
makes case-folding unreversible. If there are languages that operate the
other way as well, then neither fold-to-upper nor fold-to-lower can work
for all languages: either way would trash some languages.

That said, I think it incumbent on XML *search engines* to support
case-folding (as well as decent treatment of accents, types of hyphens,
etc) for text content searches: Making English speakers search for

  "the" | "thE" | "tHe" | "tHE" | "The" | "ThE" | "THe" | "THE"
or
  "[tT][hH][eE]

is patently absurd; and besides, there is no user cost to (say) a Japanese
speaker if an engine *does* case-fold. Also, many languages use Roman
characters occasionally, as for acronyms; so their speakers also pay a
price if searches aren't smart enough. And the primary problems with
case-folding do not weigh so heavily in the search engine world (for
example, AltaVista isn't likely to run their main servers on cell phones
for a while yet).

Steven_DeRose@Brown.edu; http://www.stg.brown.edu/~sjd
Chief Scientist, Scholarly Technology Group, and
   Adjunct Associate Professor, Brown University
North American Editor, the Text Encoding Initiative

***************************************************************************
This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/
***************************************************************************

References:
- Case sensitivity
  - From: Stefan van den Oord <soord@vda.nl>
- Re: Case sensitivity
  - From: Eric Bohlman <ebohlman@netcom.com>

Prev by Date: Re: Array content model
Next by Date: Re: Parser Behaviour (serious)
Previous by thread: Re: Case sensitivity
Next by thread: Re: Case sensitivity
Index(es):
- Date
- Thread