xml-dev - OT: web crawling (was: Re: [xml-dev] HGRAB. Syndication. Google. Grey ar

OT: web crawling (was: Re: [xml-dev] HGRAB. Syndication. Google. Grey ar

[ Lists Home | Date Index | Thread Index ]

To: Paul T <pault12@pacbell.net>
Subject: OT: web crawling (was: Re: [xml-dev] HGRAB. Syndication. Google. Grey area.)
From: "K. Ari Krupnikov" <ari@cogsci.ed.ac.uk>
Date: Wed, 09 Jan 2002 04:11:31 +0000
Cc: xml-dev@lists.xml.org
References: <02e301c19889$99cf3680$2cd0a340@paultx> <3C3B7134.D418F69B@cogsci.ed.ac.uk> <034801c1989c$21f77470$2cd0a340@paultx>
Sender: kari@cogsci.ed.ac.uk

Paul T wrote:
> 
> > > Google does *exactly* this (and also Google
> > > provides a cached copy of the original content)
> > >
> > > That means:
> > >
> > > Either both HGRAB and Google should be sued,
> > > because they both sell the content
> > > *which does not belong to them*, or both
> > > HGRAB and Google should be considered
> > > 'just a service'.
> >
> >
> > Have a look at http://www.google.com/robots.txt
> 
> I don't understand your point. Could you pelase
> explain?
> 
> Because HGRAB, for example, is
> usually polling only home page of the website,
> they are all allowed for polling.

Not all. Some sites "Disallow: /".

> Also, I'm not sure if search engines do
> really care about the robot.txt, but that's another
> story.

Googlebot does [1], and that answers your question about the difference
between it and HGRAB.

> Also, the interesting twist is that when the
> robot encounters the website with *no*
> robots.txt ( most of the sites have no robots.txt )
> the robot assumes that it is *safe* for him to
> 'steal' the content.

No twist here; "if it [robots.txt] was not present [then] all robots
will consider themselves welcome" [2].

> I think this is really gray area and
> robots.txt is not a solution.
> At the moment, at least.

It isn't. It is just a machine-readable version of [3], kindly provided
by Google for your crawling convenience. robots.txt has no legal meaning
[4]; you probably can't be sued for disregarding it. But you can for
breaking sites' TOS agreements.


Ari.

[1] http://www.google.com/webmasters/faq.html#nocrawl

[2] http://www.robotstxt.org/wc/norobots.html#format

[3] http://www.google.com/terms_of_service.html

[4] http://www.robotstxt.org/wc/norobots.html#status

Follow-Ups:
- Re: web crawling (was: Re: [xml-dev] HGRAB. Syndication. Google. Greyarea.)
  - From: Paul T <pault12@pacbell.net>

References:
- HGRAB. Syndication. Google. Grey area.
  - From: Paul T <pault12@pacbell.net>
- Re: [xml-dev] HGRAB. Syndication. Google. Grey area.
  - From: "K. Ari Krupnikov" <ari@cogsci.ed.ac.uk>
- Re: [xml-dev] HGRAB. Syndication. Google. Grey area.
  - From: Paul T <pault12@pacbell.net>

Prev by Date: Re: [xml-dev] Interesting XML-DIST-APP thread
Next by Date: Re: [xml-dev] Interesting XML-DIST-APP thread
Previous by thread: Re: [xml-dev] HGRAB. Syndication. Google. Grey area.
Next by thread: Re: web crawling (was: Re: [xml-dev] HGRAB. Syndication. Google. Greyarea.)
Index(es):
- Date
- Thread