[
Lists Home |
Date Index |
Thread Index
]
- From: Steve DeRose <Steven_DeRose@brown.edu>
- To: xml-dev@lists.oasis-open.org
- Date: Mon, 3 Apr 2000 12:44:37 -0400
At 10:27 AM -0400 4/3/00, Eric Bohlman wrote:
>On Mon, 3 Apr 2000, Stefan van den Oord wrote:
>
>> I have a simple question, I think: is XML case sensitive? In other words,
>> are the tags case sensitive? I also mean the <?XML... tag and the <!DOCTYPE
>> tag.
>
>Yes, XML names are case-sensitive (remember that they're not restricted to
>being English names, and many non-Western languages don't even have a
>concept of case-folding).
Your answer is of course correct (XML is case-sensitive); it is also true
that "many non-Western languages don't even have a concept of
case-folding". However, the second is not the reason for the first
(granted, you didn't actually say it is -- but a reader might well take it
that way).
Languages with no need for case folding are not much of a problem: the
case-folding table or program would merely have no effect on characters
belonging to such languages: It would change 26 of our 26 alphabetic code
points, and no others. After all, in English we already use lots of
characters that don't get case-folded (like digits).
The serious problems are subtler:
The practical problem that with Unicode your folding table gets really big;
on the order of 128Kbytes instead of 256 bytes (barring compresson): this
is a pain on small devices (like a cell-phone browser), a pain to load, a
pain to implement compression for, etc.
The theoretical problem is more important: it's not the caseless languages
that pose problems, but the complicated case-folding ones. For example,
lots of languages only apply diacritical marks to lower-case letters: for
example, "a" may come with 6 different accents, but "A" takes none -- which
makes case-folding unreversible. If there are languages that operate the
other way as well, then neither fold-to-upper nor fold-to-lower can work
for all languages: either way would trash some languages.
That said, I think it incumbent on XML *search engines* to support
case-folding (as well as decent treatment of accents, types of hyphens,
etc) for text content searches: Making English speakers search for
"the" | "thE" | "tHe" | "tHE" | "The" | "ThE" | "THe" | "THE"
or
"[tT][hH][eE]
is patently absurd; and besides, there is no user cost to (say) a Japanese
speaker if an engine *does* case-fold. Also, many languages use Roman
characters occasionally, as for acronyms; so their speakers also pay a
price if searches aren't smart enough. And the primary problems with
case-folding do not weigh so heavily in the search engine world (for
example, AltaVista isn't likely to run their main servers on cell phones
for a while yet).
Steven_DeRose@Brown.edu; http://www.stg.brown.edu/~sjd
Chief Scientist, Scholarly Technology Group, and
Adjunct Associate Professor, Brown University
North American Editor, the Text Encoding Initiative
***************************************************************************
This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/
***************************************************************************
|