xml-dev - Re: [xml-dev] HGRAB. Syndication. Google. Grey area.

Re: [xml-dev] HGRAB. Syndication. Google. Grey area.

[ Lists Home | Date Index | Thread Index ]

To: Leigh Dodds <ldodds@ingenta.com>, xml-dev@lists.xml.org
Subject: Re: [xml-dev] HGRAB. Syndication. Google. Grey area.
From: Paul T <pault12@pacbell.net>
Date: Wed, 09 Jan 2002 10:16:08 -0800
References: <NCBBKFMJCLIMOBIGKFMJEEDOHAAA.ldodds@ingenta.com>

Meerkat is cool. Really. I'm thinking

about Meerkat - HGRAB gateway.

The few problems ( two minor and

one big ) I have with Meerkat are :

1. Not all RSS channels have a brief

description.

2. Many interesting websites don't

bother to provide RSS ( or robots.txt for

that matter ;-)

3. I want to syndicate what I want

with the GUI, that I want, but

not what 'they' provide me with.

To the best of my knowledge, for example,

W3C.org does not provide the RSS for their

website.

Actually, I can write HGRAB -> Meerkat

gateway (so that W3C.org would make

a channel on Meerkat).

If somebody from Meerkat would be

interested - I'd be glad to do that.

Again - you are right, that Meerkat

is a very nice thing. At the beginning

I was considering using it instead of

writing HGRAB. But now I think

that it is good to have both.

Rgds.Paul.

From: Leigh Dodds

To: Paul T ; xml-dev@lists.xml.org

Sent: Wednesday, January 09, 2002 5:15 AM

Subject: RE: [xml-dev] HGRAB. Syndication. Google. Grey area.

Why not just use Meerkat? It's got a good selection

of XML channels, and you can customise it using

the 'Mobs' feature.

I assume it also builds a local index because you can

perform a search amongst the feeds.

RSS channels usually have a brief description of

an article as well as a title and channel. Isn't this

enough?

Cheers,

L.

-----Original Message-----
From: Paul T [mailto:pault12@pacbell.net]
Sent: 08 January 2002 21:15
To: xml-dev@lists.xml.org
Subject: [xml-dev] HGRAB. Syndication. Google. Grey area.

Some time ago I proposed that some knowledge

base about XML-related words could be created

so that people can submit/organize XML-related

words into some 'clusters'.

I decided that before writing such a system,

I should write some other system, that would

allow me to "automatically get all the XML-related

news".

So I've written a simple 'news feed', which

polls some news sources and syndicates

the 'news feed' for me.

The alpha version of this syndicator is

located at http://www.pault.com/hgrab

It is yet incomplete and not all XML-related

sources are syndicated ( I greatly appreciate

any URLs to XML-news websites ), but it should

give the idea.

I believe that for some cases it is

*very* convenient to get such a news feed,

rather than browse each website

or use Google. The problem with RSS / RDF

is that none of the RSS / RDF sources

that I've seen provides the information

other than the title and url. That's is just

'not enough'.

However, there is one fundamental

legal problem with current HGRAB

design.

The first impression is : "it is suspicious,

because it is not you who is creating the

content". ( that's why HGRAB strips

the original markup, so that the user is

enforced to go to the original news source ).

However, the more I think about this, the

more interesting it gets.

So, what HGRAB actually does? It polls

the HTML pages ( once in a while,

no harm done to the load of original

website ). Then it places some part of the

content into HGRAB database (for future

searching). Then it provides the end-user

with some 'part' of the original news item

and with the URL to the original news source.

Google does *exactly* this (and also Google

provides a cached copy of the original content)

That means:

Either both HGRAB and Google should be sued,

because they both sell the content

*which does not belong to them*, or both

HGRAB and Google should be considered

'just a service'.

Add the (similar) legal problems that Napster

( and many other P2P networks ) ...

"What can you do to the content created not by you"

is *really* a tricky question.

Is Google illegal ? I'm not a lawyer - I don't know.

My conclusion is that the Internet is a legal mess

and next years there would be more work for

lawyers, than for developers.

Whatever. I'm not a lawyer, so I would continue to

improve the syndicator (anyway, my goal still is

'XML-words' knowledge base, the HGRAB

was a side-effect ) so :

1. Do you know about some nice XML-news

( or 'general IT' ) websites other than:

http://www.pault.com:88/hgrab/sources

Could you please drop me the URL?

2. If HGRAB looks interesting to you (in any way) -

you're welcome to write me. I'm thinking to make

it into standalone product, but ... Thinking ...

Even in alpha, it already appears to work and

adding a new 'news Source' is usually a matter

of 15 minutes (that was the goal)

Rgds.Paul.

PS.

HGRAB is written in XSLT, XML Chunks,

SQL, Perl. When writing it, I found that

XSLT is *not* actually a good tool for

*processing* of mixed content ( XSLT is

good for *rendering* mixed content - which

is different task. XPath axes are kinda

'orthogonal' to 'good-old regular expressions'

machinery and good-old regular expressions

work 'better'. I can elaborate, if somebody

is interested )

Smart SPAM Filter
http://www.spafi.com

References:
- RE: [xml-dev] HGRAB. Syndication. Google. Grey area.
  - From: "Leigh Dodds" <ldodds@ingenta.com>

Prev by Date: Re: [xml-dev] HGRAB. Syndication. Google. Grey area.
Next by Date: RE: [xml-dev] [ANN] XML Limerick Competition
Previous by thread: RE: [xml-dev] HGRAB. Syndication. Google. Grey area.
Next by thread: A thought on XLink in PIs for processing directives
Index(es):
- Date
- Thread