OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   [xml-dev] off-topic -- search engines

[ Lists Home | Date Index | Thread Index ]
  • To: <xml-dev@lists.xml.org>
  • Subject: [xml-dev] off-topic -- search engines
  • From: "Jason Kohls" <jkohls@infotechresearchgroup.com>
  • Date: Tue, 30 Sep 2003 10:20:30 -0400
  • Thread-index: AcOHVJ2AtWFF2BgGRam+jB0IM3ZY7AAAA99A
  • Thread-topic: [xml-dev] off-topic -- search engines


I realise this is slightly off-topic.  However:
A) I can't find a search engine mailing list (know of any?)
B) I knew I could count on my knowledgeable XML brothers. :)

Indexing your content stored in XML for your content-rich site -- many
articles, many white papers, etc.  Should the "crawler" have access to
the data layer, with rules and exceptions applied much like you would a
"normal" query i.e. only crawl the <content> nodes with a value of
"article" for the "type" attribute.

Or should it access the content at a much higher abstraction, say
through HTTP GET, like a GoogleBot or an AltaVistaBot?

My concerns are based around granularity, exclusivity, and accuracy --
if an article is rendered on a page with navigation items, footer,
copyright, etc., will it "skew" the results or even worse, actually
return a record for "copyright mycompany"?  What about an article called
"How to Buy a Search Engine".  This article is linked many, many times
throughout the site.  If I search on "Search Engine", what will the
results return?  All those pages that had the title text/link in it?

I realise that these search engines have built-in exceptions but my
concern is that these are at a high-level (post HTML rendering) not at
the data layer where more specific, "limitless" control is available.

Thanks for humoring me.

Jason Kohls 

The xml-dev list is sponsored by XML.org 
<http://www.xml.org>, an initiative of OASIS 

The list archives are at http://lists.xml.org/archives/xml-dev/

To subscribe or unsubscribe from this list use the subscription
manager: <http://lists.xml.org/ob/adm.pl>


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS