[
Lists Home |
Date Index |
Thread Index
]
I forgot one(two) more html parser:
You have Anders Kristensens HEX - http://www-
uk.hpl.hp.com/people/sth/java/hex.html. It is quite old. It uses sax(1),
to build a dom tree. I updated it to sax2 and xerces dom, but then
started on my own project.
The way that hex handles the wellformnes, is when building the dom tree.
I moved that into a XMLFilter that allows you to do wellformnes on the
sax stream. Anders is not a HP anymore, he works for another company, so
the mail adress wont work!.
IBM has a "system" ANDES, witch is(was - i don't know) used to parse
html pages (a lot of other interesting things), from the papers I read
it sounded just like the tool I wanted, but I could not find any
information on IBM sites. Anyone has any info about ANDES?
Niels Peter
On Monday, March 4, 2002, at 08:12 PM, Niels Peter Strandberg wrote:
> In Java you have JTidy - http://lempinen.net/sami/jtidy/ or
> http://sourceforge.net/projects/jtidy/
> It build it own w3c DOM tree. But you can traverse the tree to generate
> SAX events, or build a new Xerces, JDOM tree from the sax events. But
> tidy doesn't handle doublet attributes + more.
>
> In C you have Tidy for all major platforms, and it is very fast. GUI's
> exists. I can be found here - http://tidy.sourceforge.net/
>
> In Java, Andy Clark, IBM a Xerces programmer, has made a "preview" of
> a HTML parser using the new Xerces xni. He posted the source code to
> the xerces mailing list. Andy Clark is a parser profs. so he know what
> he is doing.
>
> Im also working on a HTML parser, but it to early to talk about.
> Parsing HTML documents is often for capturing information from a page,
> and I find myself using XSLT, XMLFilters etc. to extract data, and it
> is powerful but not very simple.
>
> A html parsing is not always about wellformnes, but about extracting
> information, using RE, simple text patterns. Then you have XPath and
> XSL, witch requires wellformed (x)html document to work, and that
> requires building of dom trees, witch is a memory and speed problem.
> Much more could be said on this.....
>
> Digital (now compaq) tried to make a "web language" that you can use to
> fetch pages from the web, and extract data. Take a look at it at -
> http://www.research.compaq.com/SRC/WebL/. There is problems with java
> 1.3, you need to make some small changes to the source code (Im running
> it on java 1.3 on Mac os X).
>
> Niels Peter
>
>
> On Monday, March 4, 2002, at 06:24 PM, Alexey N. Shananin wrote:
>
> >Hi!
> >I'm looking for a parser for HTML.
> >I know that XML parsers can't correctly handle HTML tags because of
> theese
> >tags might be unclosed( I mean <br> tag, but not <br/> or for
> example...).
> ZI heared about XHTML standart. It's supported by XML parsers, as far
> as I
>
>
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
>
> The list archives are at http://lists.xml.org/archives/xml-dev/
>
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>
>
|