[
Lists Home |
Date Index |
Thread Index
]
Paul T wrote:
>
> > > Google does *exactly* this (and also Google
> > > provides a cached copy of the original content)
> > >
> > > That means:
> > >
> > > Either both HGRAB and Google should be sued,
> > > because they both sell the content
> > > *which does not belong to them*, or both
> > > HGRAB and Google should be considered
> > > 'just a service'.
> >
> >
> > Have a look at http://www.google.com/robots.txt
>
> I don't understand your point. Could you pelase
> explain?
>
> Because HGRAB, for example, is
> usually polling only home page of the website,
> they are all allowed for polling.
Not all. Some sites "Disallow: /".
> Also, I'm not sure if search engines do
> really care about the robot.txt, but that's another
> story.
Googlebot does [1], and that answers your question about the difference
between it and HGRAB.
> Also, the interesting twist is that when the
> robot encounters the website with *no*
> robots.txt ( most of the sites have no robots.txt )
> the robot assumes that it is *safe* for him to
> 'steal' the content.
No twist here; "if it [robots.txt] was not present [then] all robots
will consider themselves welcome" [2].
> I think this is really gray area and
> robots.txt is not a solution.
> At the moment, at least.
It isn't. It is just a machine-readable version of [3], kindly provided
by Google for your crawling convenience. robots.txt has no legal meaning
[4]; you probably can't be sued for disregarding it. But you can for
breaking sites' TOS agreements.
Ari.
[1] http://www.google.com/webmasters/faq.html#nocrawl
[2] http://www.robotstxt.org/wc/norobots.html#format
[3] http://www.google.com/terms_of_service.html
[4] http://www.robotstxt.org/wc/norobots.html#status
|