OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

HTML scraping

If you want to scrap everything (transforming HTML into XML), then Tidy
is the right way to go (as mentioned in a previous posting).

If you want to extract only SOME HTML information and map it to XML,
then you should look at W4F (http://db.cis.upenn.edu/W4F/).

There are a couple of on-line examples that show how to build XML
gateways that transform on-the-fly HTML into XML. The XML can then be
used by other applications.

There is also an interesting related article in JavaWorld: