xml-dev - Re: [xml-dev] generating DOM from ill-formed HTML docs

Re: [xml-dev] generating DOM from ill-formed HTML docs

[ Lists Home | Date Index | Thread Index ]

To: xml-dev@lists.xml.org
Subject: Re: [xml-dev] generating DOM from ill-formed HTML docs
From: "Thomas B. Passin" <tpassin@comcast.net>
Date: Sun, 14 Jul 2002 23:20:07 -0400
References: <20020715020006.44124.qmail@web9807.mail.yahoo.com>

[Robert Mena]

> Hi, I am developing an application that will have to
> build a DOM tree of html pages.
>
> I'll use such DOM trees to perform some
> analysis/comparisons.
>
> Since most of the time I'll find ill-formed documents
> I'd like to know if there are any parsers out there
> that "accept" this flaws and builds the tree anyway.
>
> I've tried domxml (php) with no luck.

The usual answer is to preprocess with Tidy - see

http://www.w3.org/People/Raggett/tidy/

You may also want to look at NekoHTML, at

http://www.apache.org/~andyc/

This work processed html, including fixing up some problems, and uses the
Xerxes JNI so you can build  a DOM.

Cheers,

Tom P

Follow-Ups:
- Re: [xml-dev] generating DOM from ill-formed HTML docs
  - From: Robert Mena <rt_mena@yahoo.com>

References:
- generating DOM from ill-formed HTML docs
  - From: Robert Mena <rt_mena@yahoo.com>

Prev by Date: generating DOM from ill-formed HTML docs
Next by Date: Re: [xml-dev] generating DOM from ill-formed HTML docs
Previous by thread: generating DOM from ill-formed HTML docs
Next by thread: Re: [xml-dev] generating DOM from ill-formed HTML docs
Index(es):
- Date
- Thread