[
Lists Home |
Date Index |
Thread Index
]
- From: "Michael Kay" <M.H.Kay@eng.icl.co.uk>
- To: <xml-dev@ic.ac.uk>
- Date: Thu, 5 Nov 1998 12:27:12 -0000
>Hi all,
>Can anyone tell me where the difference lies in implementing a search
>engine for HTML and a search engine for XML.
The main difference is that in HTML the tagging is almost useless in
localising the query, whereas in XML it is potentially very valuable. Many
search engines support field-oriented query, e.g. find "Ireland" as a
surname; with the right input filter for XML it becomes possible to map XML
elements to the fields understood by the search engine, making such queries
a feasible proposition, which is not the case for HTML.
Switching thrreads, I am a little surprised by Tim's remarks on word
proximity versus character proximity. Confining our attention to European
languages (as most search engines do), word proximity searching is a common
feature of the high-end search engines, whereas character proximity is
hardly found outside basic desktop tools like grep. Apart from anything
else, once you've done the word normalisation (normalising different
linguistic forms or spellings of the same word), character proximity is
meaningless. In the older boolean engines word proximity is used rather
mechanistically, in the newer engines it is used more subtly as part of a
statistical or linguistic approach to relevance ranking, but either way it
is an established feature of the scene, and it is not there on whim: the
search algorithms used are based on extensive research and benchmarking of
relevance and recall scores.
An interesting comparison of web search engines is at
http://www.netstrider.com/search/features.html ; this asserts that all the
well-known web search engines other than Lycos use word proximity matching.
(A good survey in spite of the fact that it fails to distinguish the
effectiveness of the query matcher from the effectiveness of the web
crawler)
Mike Kay
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)
|