[
Lists Home |
Date Index |
Thread Index
]
In Java you have JTidy - http://lempinen.net/sami/jtidy/ or
http://sourceforge.net/projects/jtidy/
It build it own w3c DOM tree. But you can traverse the tree to generate
SAX events, or build a new Xerces, JDOM tree from the sax events. But
tidy doesn't handle doublet attributes + more.
In C you have Tidy for all major platforms, and it is very fast. GUI's
exists. I can be found here - http://tidy.sourceforge.net/
In Java, Andy Clark, IBM a Xerces programmer, has made a "preview" of a
HTML parser using the new Xerces xni. He posted the source code to the
xerces mailing list. Andy Clark is a parser profs. so he know what he is
doing.
Im also working on a HTML parser, but it to early to talk about. Parsing
HTML documents is often for capturing information from a page, and I
find myself using XSLT, XMLFilters etc. to extract data, and it is
powerful but not very simple.
A html parsing is not always about wellformnes, but about extracting
information, using RE, simple text patterns. Then you have XPath and
XSL, witch requires wellformed (x)html document to work, and that
requires building of dom trees, witch is a memory and speed problem.
Much more could be said on this.....
Digital (now compaq) tried to make a "web language" that you can use to
fetch pages from the web, and extract data. Take a look at it at -
http://www.research.compaq.com/SRC/WebL/. There is problems with java
1.3, you need to make some small changes to the source code (Im running
it on java 1.3 on Mac os X).
Niels Peter
On Monday, March 4, 2002, at 06:24 PM, Alexey N. Shananin wrote:
>Hi!
>I'm looking for a parser for HTML.
>I know that XML parsers can't correctly handle HTML tags because of
theese
>tags might be unclosed( I mean <br> tag, but not <br/> or for
example...).
ZI heared about XHTML standart. It's supported by XML parsers, as far
as I
|