[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

HTML scraping

From: Arnaud Sahuguet <sahuguet@gradient.cis.upenn.edu>
To: XML-DEV <xml-dev@lists.xml.org>, Gul Imran <gimran@nortelnetworks.com>
Date: Sat, 24 Mar 2001 17:29:41 -0500

If you want to scrap everything (transforming HTML into XML), then Tidy
is the right way to go (as mentioned in a previous posting).

If you want to extract only SOME HTML information and map it to XML,
then you should look at W4F (http://db.cis.upenn.edu/W4F/).

There are a couple of on-line examples that show how to build XML
gateways that transform on-the-fly HTML into XML. The XML can then be
used by other applications.
http://db.cis.upenn.edu/W4F/Examples/XML-Gateway/

There is also an interesting related article in JavaWorld:
http://www.javaworld.com/javaworld/jw-03-2001/jw-0316-webdb.html

Regards,

Arnaud

Follow-Ups:
- Re: HTML scraping
  - From: XML Everywhere <host@xmleverywhere.com>

Prev by Date: Re: ANN: examplotron.
Next by Date: Re: HTML scraping
Previous by thread: examplotron v0.2
Next by thread: Re: HTML scraping
Index(es):
- Date
- Thread