xml-dev - [xml-dev] off-topic -- search engines

[xml-dev] off-topic -- search engines

[ Lists Home | Date Index | Thread Index ]

To: <xml-dev@lists.xml.org>
Subject: [xml-dev] off-topic -- search engines
From: "Jason Kohls" <jkohls@infotechresearchgroup.com>
Date: Tue, 30 Sep 2003 10:20:30 -0400
Thread-index: AcOHVJ2AtWFF2BgGRam+jB0IM3ZY7AAAA99A
Thread-topic: [xml-dev] off-topic -- search engines

Greetings,

I realise this is slightly off-topic.  However:
A) I can't find a search engine mailing list (know of any?)
B) I knew I could count on my knowledgeable XML brothers. :)

Indexing your content stored in XML for your content-rich site -- many
articles, many white papers, etc.  Should the "crawler" have access to
the data layer, with rules and exceptions applied much like you would a
"normal" query i.e. only crawl the <content> nodes with a value of
"article" for the "type" attribute.

Or should it access the content at a much higher abstraction, say
through HTTP GET, like a GoogleBot or an AltaVistaBot?

My concerns are based around granularity, exclusivity, and accuracy --
if an article is rendered on a page with navigation items, footer,
copyright, etc., will it "skew" the results or even worse, actually
return a record for "copyright mycompany"?  What about an article called
"How to Buy a Search Engine".  This article is linked many, many times
throughout the site.  If I search on "Search Engine", what will the
results return?  All those pages that had the title text/link in it?

I realise that these search engines have built-in exceptions but my
concern is that these are at a high-level (post HTML rendering) not at
the data layer where more specific, "limitless" control is available.

Thanks for humoring me.

Jason Kohls 

The xml-dev list is sponsored by XML.org 
<http://www.xml.org>, an initiative of OASIS 
<http://www.oasis-open.org>

The list archives are at http://lists.xml.org/archives/xml-dev/

To subscribe or unsubscribe from this list use the subscription
manager: <http://lists.xml.org/ob/adm.pl>

Prev by Date: xml-dev still being archived?
Next by Date: RE: [xml-dev] Beyond Ontologies
Previous by thread: xml-dev still being archived?
Next by thread: RE: [xml-dev] FW: [ANN]: XQuery: A Guided Tour
Index(es):
- Date
- Thread