[
Lists Home |
Date Index |
Thread Index
]
> > I don't understand your point. Could you pelase
> > explain?
> >
> > Because HGRAB, for example, is
> > usually polling only home page of the website,
> > they are all allowed for polling.
>
> Not all. Some sites "Disallow: /".
None of the sites, syndicated by HGRAB, has such
a robots.txt.
> > Also, I'm not sure if search engines do
> > really care about the robot.txt, but that's another
> > story.
>
> Googlebot does [1], and that answers your question about the difference
> between it and HGRAB.
So HGRAB 'cares' as well. But you are right, HGRAB
should read the robots.txt file in the polling script.
Just in case.
After that, to be exactly like Google,
the *only* thing I need is to write on HGRAB website :
"if you want me not to grab your content - protect
yourself with tricky robots.txt". Because *that*
is what Google does ( from your URLs )
But it would not change the actual situation.
The actual situation is that HGRAB is like Google
and I think that both 'look illegal'.
> > Also, the interesting twist is that when the
> > robot encounters the website with *no*
> > robots.txt ( most of the sites have no robots.txt )
> > the robot assumes that it is *safe* for him to
> > 'steal' the content.
>
> No twist here; "if it [robots.txt] was not present [then] all robots
> will consider themselves welcome" [2].
Sure. I think that's until the first lawsuite.
> > I think this is really gray area and
> > robots.txt is not a solution.
> > At the moment, at least.
>
> It isn't.
It is. That's what *you* write by yourself. See below.
<aside>
I've looked at many robots.txt files and nobody
disallows the /. Maybe there are some especial websites
that *do* that, but http://www.metasystema.org/terms.mhtml
looks like a *very* rare example to me. But that's
irrelevant, because you make a stronger point.
</aside>
> It is just a machine-readable version of [3], kindly provided
> by Google for your crawling convenience. robots.txt has no legal meaning
> [4]; you probably can't be sued for disregarding it. But you can for
> breaking sites' TOS agreements.
Exactly! So *can* be Google. That was my point.
There is no significant difference between HGRAB
and Google.
Thank you for the URLs.
I'l put a few lines on a website and a few lines into
the polling script. Like Google did.
Rgds.Paul.
PS. This all means that some company may write
some very tricky TOS (that crowler would not
understand), feed poor Google's robot with some
pages and then just wait for those pages to - become
available in the Google's cache and then start
the game.
PPS. I already got the situation when
Google composed a short description
that looked like
'Paul Tchistopolskii said:
"All W3C members are morons"'.
The problem was that:
1. There was a thread on some webforum,
that had a subject "All W3C members are morons".
2. I participated in that thread saying that
*this is not true* and then explaining
why I think some things may look strange
to W3C outsider.
3. Google composed it *wrong* ( because
it just put together the title of the thread and
my name ).
4. The original web-forum thread *has been removed*.
5. So, for a couple of months, the person, who
would type my name into Googles' search engine,
may think that that's what I've said.
Not nice.
So - how much Google should pay me for this glitch
in their software?
I think that lawyers will have a plenty of food next years.
The ownership and operations on Web content are
tricky things.
> [1] http://www.google.com/webmasters/faq.html#nocrawl
>
> [2] http://www.robotstxt.org/wc/norobots.html#format
>
> [3] http://www.google.com/terms_of_service.html
>
> [4] http://www.robotstxt.org/wc/norobots.html#status
|