Lists Home |
Date Index |
- To: <email@example.com>
- Subject: [xml-dev] off-topic -- search engines
- From: "Jason Kohls" <firstname.lastname@example.org>
- Date: Tue, 30 Sep 2003 10:20:30 -0400
- Thread-index: AcOHVJ2AtWFF2BgGRam+jB0IM3ZY7AAAA99A
- Thread-topic: [xml-dev] off-topic -- search engines
I realise this is slightly off-topic. However:
A) I can't find a search engine mailing list (know of any?)
B) I knew I could count on my knowledgeable XML brothers. :)
Indexing your content stored in XML for your content-rich site -- many
articles, many white papers, etc. Should the "crawler" have access to
the data layer, with rules and exceptions applied much like you would a
"normal" query i.e. only crawl the <content> nodes with a value of
"article" for the "type" attribute.
Or should it access the content at a much higher abstraction, say
through HTTP GET, like a GoogleBot or an AltaVistaBot?
My concerns are based around granularity, exclusivity, and accuracy --
if an article is rendered on a page with navigation items, footer,
copyright, etc., will it "skew" the results or even worse, actually
return a record for "copyright mycompany"? What about an article called
"How to Buy a Search Engine". This article is linked many, many times
throughout the site. If I search on "Search Engine", what will the
results return? All those pages that had the title text/link in it?
I realise that these search engines have built-in exceptions but my
concern is that these are at a high-level (post HTML rendering) not at
the data layer where more specific, "limitless" control is available.
Thanks for humoring me.
The xml-dev list is sponsored by XML.org
<http://www.xml.org>, an initiative of OASIS
The list archives are at http://lists.xml.org/archives/xml-dev/
To subscribe or unsubscribe from this list use the subscription