xml-dev - Re: [xml-dev] HTML parser

Re: [xml-dev] HTML parser

[ Lists Home | Date Index | Thread Index ]

To: shananin@park.ru, xml-dev@lists.xml.org
Subject: Re: [xml-dev] HTML parser
From: Niels Peter Strandberg <nielspeter@npstrandberg.com>
Date: Mon, 4 Mar 2002 22:09:01 +0100
In-reply-to: <C7CCC6E0-2FA3-11D6-83B3-000502CB905D@npstrandberg.com>

I forgot one(two) more html parser:

You have Anders Kristensens HEX - http://www-
uk.hpl.hp.com/people/sth/java/hex.html. It is quite old. It uses sax(1), 
to build a dom tree. I updated it to sax2 and xerces dom, but then 
started on my own project.

The way that hex handles the wellformnes, is when building the dom tree. 
I moved that into a XMLFilter that allows you to do wellformnes on the 
sax stream. Anders is not a HP anymore, he works for another company, so 
the mail adress wont work!.

IBM has a "system" ANDES, witch is(was - i don't know) used to parse 
html pages (a lot of other interesting things), from the papers I read 
it sounded just like the tool I wanted, but I could not find any 
information on IBM sites. Anyone has any info about ANDES?

Niels Peter


On Monday, March 4, 2002, at 08:12 PM, Niels Peter Strandberg wrote:

> In Java you have JTidy - http://lempinen.net/sami/jtidy/ or 
> http://sourceforge.net/projects/jtidy/
> It build it own w3c DOM tree. But you can traverse the tree to generate 
> SAX events, or build a new Xerces, JDOM tree from the sax events. But 
> tidy doesn't handle doublet attributes + more.
>
> In C you have Tidy for all major platforms, and it is very fast. GUI's 
> exists. I can be found here - http://tidy.sourceforge.net/
>
> In Java, Andy Clark, IBM  a Xerces programmer, has made a "preview" of 
> a HTML parser using the new Xerces xni. He posted the source code to 
> the xerces mailing list. Andy Clark is a parser profs. so he know what 
> he is doing.
>
> Im also working on a HTML parser, but it to early to talk about. 
> Parsing HTML documents is often for capturing information from a page, 
> and I find myself using XSLT, XMLFilters etc. to extract data, and it 
> is powerful but not very simple.
>
> A html parsing is not always about wellformnes, but about extracting 
> information, using RE, simple text patterns. Then you have XPath and 
> XSL, witch requires wellformed (x)html document to work, and that 
> requires building of dom trees, witch is a memory and speed problem. 
> Much more could be said on this.....
>
> Digital (now compaq) tried to make a "web language" that you can use to 
> fetch pages from the web, and extract data. Take a look at it at -  
> http://www.research.compaq.com/SRC/WebL/.  There is problems with java 
> 1.3, you need to make some small changes to the source code (Im running 
> it on java 1.3 on Mac os X).
>
> Niels Peter
>
>
> On Monday, March 4, 2002, at 06:24 PM, Alexey N. Shananin wrote:
>
> >Hi!
> >I'm looking for a parser for HTML.
> >I know that XML parsers can't correctly handle HTML tags because of 
> theese
> >tags might be unclosed( I mean <br> tag, but not <br/> or for 
> example...).
> ZI heared about XHTML standart. It's supported by XML parsers, as far 
> as I
>
>
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
>
> The list archives are at http://lists.xml.org/archives/xml-dev/
>
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>
>

References:
- Re: [xml-dev] HTML parser
  - From: Niels Peter Strandberg <nielspeter@npstrandberg.com>

Prev by Date: RE: [xml-dev] Namespaces and URIs (was: A good case for Namespace URIs)
Next by Date: Re: [xml-dev] Namespaces and URIs (was: A good case for Namespace URIs)
Previous by thread: Re: [xml-dev] HTML parser
Next by thread: Web Bug Simulator for XML
Index(es):
- Date
- Thread