xml-dev - searching for search

searching for search

[ Lists Home | Date Index | Thread Index ]

Regarding the recent "Indexing XML Document Collections" thread...

I've been doing some breadth-first search for indexing/query
technology, and here is a summary of what i've learned.
I'm posting this because I'm interested in the area but don't
have the time to investigate all these, and it seems like
there are some real experts on this list.

I'm interested in these questions:

- in general, why would I pick one of these over another
(i.e. boolean query vs. structured query; scalability in size
or requests; pluggable format drivers for source data;
stemming and concept support; etc.)

- in general, what are the features that push a technology
into another level of complexity and why (i.e. what is so
hard here?)

- specifically, what are the characteristics of each of
these in performance/reliability/features (personal experience
from non-vendors and public benchmarks are of course preferred,
but vendor claims might be of interest too)

- can i safely ignore the non open source ones without giving
up capabilities

- if all i wanted to do was boolean search on field values with
no stemming/concept support, then regardless of how i did the
indexing, what is wrong with using standard b-trees and/or just
putting the index data in a sql db?

indexing/query technologies
---------------------------
what: sgrep
url: http://www.cs.helsinki.fi/~jjaakkol/sgrep.html
license: GPL
comment: does structured document grep, with an indexing phase.

what: Xtract
url: http://www.cs.york.ac.uk/fp/Xtract/
license: GPL
comment: another xml grep; more XQL-like. no indexing.

what: swish (Simple Web Indexing System for Humans)
url: http://www.directive.com/swish.htm
license: sort of free
comment: see swish-e

what: swish-e (swish-enhanced)
url: http://sunsite.berkeley.edu/SWISH-E/
license: GPL
comment: focused specifically on web site indexing.

what: MG (managing gigabytes)
url: http://www.mds.rmit.edu.au/mg/intro/about_mg.html
license: GPL
comment: based on book: http://www.cs.mu.oz.au/mg. commercial version is SIM: http://www.mds.rmit.edu.au

what: wais and freeWAIS and freewais-sf/SFgate
url: http://www.faqs.org/faqs/wais-faq/freeWAIS-sf/index.html
comment: now supplanted by Isearch/Isite.

what: Isearch
url: http://www.etymon.com/Isearch
license: non-copyleft free.
comment: Isearch is behind dmoz/newhoo (http://www.news.com/News/Item/0,4,28964,00.html?st.cn.News.today.ne)

what: dig or "ht://dig"
url: http://www.htdig.org/
license: GPL

what: glimpse
url: http://glimpse.cs.arizona.edu/
license: non-commercial use, open source.

commercial:
Readware http://www.readware.com/products.htm
Excalibur RetrievalWare http://www.excalib.com/
verity http://www.verity.com
oracle intermedia http://www.oracle.com
fulcrum http://www.fulcrum.com (now pcdocs)
OpenText http://www.opentext.com/ (soon to be pcdocs?)
SIM: http://www.mds.rmit.edu.au

no cost, but object code only:
excite for web servers http://www.excite.com/navigate/
PLS http://www.pls.com/ acquired by AOL.
GMD-IPSI XQL http://xml.darmstadt.gmd.de/xql/.
thunderstone http://www.thunderstone.com/. webinator is no cost, object code only.

"XML Servers" (which can mean anything)
bluestone http://www.bluestone.com/
odi excelon http://www.odi.com/
softwareag tamino http://www.softwareag.com/tamino/default.htm
poet cms http://www.poet.com/
oracle ifs, dbweb, etc. http://www.oracle.com

query/search languages and standards
-------------------------------------

Z39.50-1995 http://lcweb.loc.gov/z3950/agency
aka ISO 23950 ; formerly ISO 10162 and ISO 10163.
basically the U.S. started branching the original ISO standard, and now they lead the ISO standard.
WAIS was based on the first version Z39.50-1988.
see also http://www.faqs.org/rfcs/rfc1729.html
for history see http://mirrored.ukoln.ac.uk/lis-journals/dlib/dlib/dlib/april97/04lynch.html and
http://slis6000.slis.uwo.ca/~jxerri/index.html

GILS (government information locator service) http://www.gils.net/locator.html
for technology, just aggregates other projects (uses Isearch, htdig, etc.).
at a standards level, it subsets Z39.50 and articulates some 150 specific attributes/elements for semantics,
in the "GILS Profile" http://www.gils.net/prof_v2.html
[there, i've now saved you from reading a horrific amount of verbiage.]

STARTS http://www-db.stanford.edu/~gravano/starts.html
a standardization effort like GILS. subsets Z39.50.
complementary (sort of) to publication/metadata/robots.txt standards like dublin/rdf.

SDQL (structured document query language)
DSSSL thing. http://www.jclark.com/dsssl/sgml95/sdql.html, http://www.jclark.com/dsssl/IS/dsssl85.htm

SOIF (Summary Object Interchange Format)
first made up by Harvest in 1994.

CIP (Common Indexing Protocol)
output of the moribund ietf FIND working group

XQL and XML-QL and a gazillion more http://www.w3.org/TandS/QL/QL98/pp.html

OQL http://www.odmg.org/standard/odmgbookextract.htm#Chapter 4

Search UI
---------
what: WWWWAIS
url: http://riceinfo.rice.edu/sw/swish/patches/
comment: web interface to WAIS and SWISH search engines

what: webglimpse
url: http://donkey.cs.arizona.edu/webglimpse/
comment: web interface

what: HURL (Hypertext Usenet Reader & Linker)
url: http://impressive.net/software/hurl
license: will be free software.
comment: uses glimpse underneath

Gathering/Spidering
-------------------
what: harvest
url: http://www.tardis.ed.ac.uk/harvest/
comment: just does the spidering; the index is with glimpse
notes::
verity etc. could be used instead of glimpse.
does provide a "Broker" cgi around the indexer.
maps SGML to "SOIF".
::

Papers/Reading on IR
--------------------
ACM SIGIR http://www.acm.org/sigir/

news:comp.infosystems.search

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)

Follow-Ups:
- Re: searching for search
  - From: Walter Underwood <wunder@infoseek.com>
- Re: searching for search
  - From: "Edward C. Zimmermann" <edz@bsn.com>