xml-dev - Re: web crawling (was: Re: [xml-dev] HGRAB. Syndication. Google. Greyare

Re: web crawling (was: Re: [xml-dev] HGRAB. Syndication. Google. Greyare

[ Lists Home | Date Index | Thread Index ]

To: "K. Ari Krupnikov" <ari@cogsci.ed.ac.uk>
Subject: Re: web crawling (was: Re: [xml-dev] HGRAB. Syndication. Google. Greyarea.)
From: Paul T <pault12@pacbell.net>
Date: Tue, 08 Jan 2002 21:13:16 -0800
Cc: xml-dev@lists.xml.org
References: <02e301c19889$99cf3680$2cd0a340@paultx><3C3B7134.D418F69B@cogsci.ed.ac.uk> <034801c1989c$21f77470$2cd0a340@paultx><3C3BC2F3.80B93425@cogsci.ed.ac.uk>


> > I don't understand your point. Could you pelase
> > explain?
> > 
> > Because HGRAB, for example, is
> > usually polling only home page of the website,
> > they are all allowed for polling.
> 
> Not all. Some sites "Disallow: /".

None of the sites, syndicated by HGRAB, has such 
a robots.txt. 
 
> > Also, I'm not sure if search engines do
> > really care about the robot.txt, but that's another
> > story.
> 
> Googlebot does [1], and that answers your question about the difference
> between it and HGRAB.

So HGRAB 'cares' as well.  But you are right, HGRAB 
should read the robots.txt file in the polling script. 
Just in case.

After that, to be exactly like Google, 
the *only* thing I need is to write on HGRAB website :
"if you want me not to grab your content - protect 
yourself with tricky robots.txt". Because  *that*
is what Google does ( from your URLs )

But it would not change the actual situation. 
The actual situation is that HGRAB is like Google
and I think that both 'look illegal'. 

> > Also, the interesting twist is that when the
> > robot encounters the website with *no*
> > robots.txt ( most of the sites have no robots.txt )
> > the robot assumes that it is *safe* for him to
> > 'steal' the content.
> 
> No twist here; "if it [robots.txt] was not present [then] all robots
> will consider themselves welcome" [2].

Sure. I think that's until the first lawsuite. 
 
> > I think this is really gray area and
> > robots.txt is not a solution.
> > At the moment, at least.
> 
> It isn't. 

It is.  That's what *you* write by yourself. See below.

<aside>
I've looked at many robots.txt files and nobody 
disallows the /. Maybe there are some especial websites 
that *do* that, but http://www.metasystema.org/terms.mhtml
looks like a  *very* rare example to me. But that's 
irrelevant, because you make a stronger point.
</aside>

> It is just a machine-readable version of [3], kindly provided
> by Google for your crawling convenience. robots.txt has no legal meaning
> [4]; you probably can't be sued for disregarding it. But you can for
> breaking sites' TOS agreements.

Exactly! So *can* be Google. That was my point.

There is no significant difference between HGRAB 
and Google. 

Thank you for the URLs.

I'l put a few lines on a website and a few lines into 
the polling script. Like Google did. 

Rgds.Paul.

PS. This all means that some company may write 
some very tricky TOS (that crowler would not 
understand), feed poor Google's robot with some 
pages and then just wait for those pages to - become 
available in the Google's cache and then start 
the game.

PPS. I already got the situation when 
Google composed a short description
that looked like 

'Paul Tchistopolskii said:  
"All W3C members are morons"'.

The problem was that:

1. There was a thread on some webforum, 
that had a subject "All W3C members are morons".

2. I participated in that thread saying that 
*this is not true* and then explaining 
why I think some things may look strange 
to W3C outsider.

3. Google composed it  *wrong* ( because 
it just put together the title of the thread and 
my name ).

4. The original web-forum thread *has been removed*.

5. So, for a couple of months, the person, who 
would type my name into Googles' search engine, 
may think that that's what I've said. 

Not nice. 

So - how much Google should pay me for this glitch
in their software? 

I think that lawyers will have a plenty of food next years. 
The ownership and operations on Web content are
tricky things.

> [1] http://www.google.com/webmasters/faq.html#nocrawl
> 
> [2] http://www.robotstxt.org/wc/norobots.html#format
> 
> [3] http://www.google.com/terms_of_service.html
> 
> [4] http://www.robotstxt.org/wc/norobots.html#status

Follow-Ups:
- Re: [xml-dev] Re: web crawling (was: Re: [xml-dev] HGRAB. Syndication. Google. Grey area.)
  - From: "Dare Obasanjo" <kpako@yahoo.com>

References:
- HGRAB. Syndication. Google. Grey area.
  - From: Paul T <pault12@pacbell.net>
- Re: [xml-dev] HGRAB. Syndication. Google. Grey area.
  - From: "K. Ari Krupnikov" <ari@cogsci.ed.ac.uk>
- Re: [xml-dev] HGRAB. Syndication. Google. Grey area.
  - From: Paul T <pault12@pacbell.net>
- OT: web crawling (was: Re: [xml-dev] HGRAB. Syndication. Google. Grey area.)
  - From: "K. Ari Krupnikov" <ari@cogsci.ed.ac.uk>

Prev by Date: Re: [xml-dev] Interesting XML-DIST-APP thread
Next by Date: Re: RE: [xml-dev] [ANN] XML Limerick Competition
Previous by thread: OT: web crawling (was: Re: [xml-dev] HGRAB. Syndication. Google. Grey area.)
Next by thread: Re: [xml-dev] Re: web crawling (was: Re: [xml-dev] HGRAB. Syndication. Google. Grey area.)
Index(es):
- Date
- Thread