Lists Home |
Date Index |
- From: "Mark D. Anderson" <firstname.lastname@example.org>
- To: <email@example.com>
- Date: Sun, 23 May 1999 13:00:42 -0700
Regarding the recent "Indexing XML Document Collections" thread...
I've been doing some breadth-first search for indexing/query
technology, and here is a summary of what i've learned.
I'm posting this because I'm interested in the area but don't
have the time to investigate all these, and it seems like
there are some real experts on this list.
I'm interested in these questions:
- in general, why would I pick one of these over another
(i.e. boolean query vs. structured query; scalability in size
or requests; pluggable format drivers for source data;
stemming and concept support; etc.)
- in general, what are the features that push a technology
into another level of complexity and why (i.e. what is so
- specifically, what are the characteristics of each of
these in performance/reliability/features (personal experience
from non-vendors and public benchmarks are of course preferred,
but vendor claims might be of interest too)
- can i safely ignore the non open source ones without giving
- if all i wanted to do was boolean search on field values with
no stemming/concept support, then regardless of how i did the
indexing, what is wrong with using standard b-trees and/or just
putting the index data in a sql db?
comment: does structured document grep, with an indexing phase.
comment: another xml grep; more XQL-like. no indexing.
what: swish (Simple Web Indexing System for Humans)
license: sort of free
comment: see swish-e
what: swish-e (swish-enhanced)
comment: focused specifically on web site indexing.
what: MG (managing gigabytes)
comment: based on book: http://www.cs.mu.oz.au/mg. commercial version is SIM: http://www.mds.rmit.edu.au
what: wais and freeWAIS and freewais-sf/SFgate
comment: now supplanted by Isearch/Isite.
license: non-copyleft free.
comment: Isearch is behind dmoz/newhoo (http://www.news.com/News/Item/0,4,28964,00.html?st.cn.News.today.ne)
what: dig or "ht://dig"
license: non-commercial use, open source.
Excalibur RetrievalWare http://www.excalib.com/
oracle intermedia http://www.oracle.com
fulcrum http://www.fulcrum.com (now pcdocs)
OpenText http://www.opentext.com/ (soon to be pcdocs?)
no cost, but object code only:
excite for web servers http://www.excite.com/navigate/
PLS http://www.pls.com/ acquired by AOL.
GMD-IPSI XQL http://xml.darmstadt.gmd.de/xql/.
thunderstone http://www.thunderstone.com/. webinator is no cost, object code only.
"XML Servers" (which can mean anything)
odi excelon http://www.odi.com/
softwareag tamino http://www.softwareag.com/tamino/default.htm
poet cms http://www.poet.com/
oracle ifs, dbweb, etc. http://www.oracle.com
query/search languages and standards
aka ISO 23950 ; formerly ISO 10162 and ISO 10163.
basically the U.S. started branching the original ISO standard, and now they lead the ISO standard.
WAIS was based on the first version Z39.50-1988.
see also http://www.faqs.org/rfcs/rfc1729.html
for history see http://mirrored.ukoln.ac.uk/lis-journals/dlib/dlib/dlib/april97/04lynch.html and
GILS (government information locator service) http://www.gils.net/locator.html
for technology, just aggregates other projects (uses Isearch, htdig, etc.).
at a standards level, it subsets Z39.50 and articulates some 150 specific attributes/elements for semantics,
in the "GILS Profile" http://www.gils.net/prof_v2.html
[there, i've now saved you from reading a horrific amount of verbiage.]
a standardization effort like GILS. subsets Z39.50.
complementary (sort of) to publication/metadata/robots.txt standards like dublin/rdf.
SDQL (structured document query language)
DSSSL thing. http://www.jclark.com/dsssl/sgml95/sdql.html, http://www.jclark.com/dsssl/IS/dsssl85.htm
SOIF (Summary Object Interchange Format)
first made up by Harvest in 1994.
CIP (Common Indexing Protocol)
output of the moribund ietf FIND working group
XQL and XML-QL and a gazillion more http://www.w3.org/TandS/QL/QL98/pp.html
OQL http://www.odmg.org/standard/odmgbookextract.htm#Chapter 4
comment: web interface to WAIS and SWISH search engines
comment: web interface
what: HURL (Hypertext Usenet Reader & Linker)
license: will be free software.
comment: uses glimpse underneath
comment: just does the spidering; the index is with glimpse
verity etc. could be used instead of glimpse.
does provide a "Broker" cgi around the indexer.
maps SGML to "SOIF".
Papers/Reading on IR
ACM SIGIR http://www.acm.org/sigir/
xml-dev: A list for W3C XML Developers. To post, mailto:firstname.lastname@example.org
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:email@example.com the following message;
To subscribe to the digests, mailto:firstname.lastname@example.org the following message;
List coordinator, Henry Rzepa (mailto:email@example.com)