OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   OT: web crawling (was: Re: [xml-dev] HGRAB. Syndication. Google. Grey ar

[ Lists Home | Date Index | Thread Index ]

Paul T wrote:
> 
> > > Google does *exactly* this (and also Google
> > > provides a cached copy of the original content)
> > >
> > > That means:
> > >
> > > Either both HGRAB and Google should be sued,
> > > because they both sell the content
> > > *which does not belong to them*, or both
> > > HGRAB and Google should be considered
> > > 'just a service'.
> >
> >
> > Have a look at http://www.google.com/robots.txt
> 
> I don't understand your point. Could you pelase
> explain?
> 
> Because HGRAB, for example, is
> usually polling only home page of the website,
> they are all allowed for polling.

Not all. Some sites "Disallow: /".

> Also, I'm not sure if search engines do
> really care about the robot.txt, but that's another
> story.

Googlebot does [1], and that answers your question about the difference
between it and HGRAB.

> Also, the interesting twist is that when the
> robot encounters the website with *no*
> robots.txt ( most of the sites have no robots.txt )
> the robot assumes that it is *safe* for him to
> 'steal' the content.

No twist here; "if it [robots.txt] was not present [then] all robots
will consider themselves welcome" [2].

> I think this is really gray area and
> robots.txt is not a solution.
> At the moment, at least.

It isn't. It is just a machine-readable version of [3], kindly provided
by Google for your crawling convenience. robots.txt has no legal meaning
[4]; you probably can't be sued for disregarding it. But you can for
breaking sites' TOS agreements.


Ari.

[1] http://www.google.com/webmasters/faq.html#nocrawl

[2] http://www.robotstxt.org/wc/norobots.html#format

[3] http://www.google.com/terms_of_service.html

[4] http://www.robotstxt.org/wc/norobots.html#status




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS