xml-dev - Re: [xml-dev] generating DOM from ill-formed HTML docs

Re: [xml-dev] generating DOM from ill-formed HTML docs

[ Lists Home | Date Index | Thread Index ]

To: xml-dev@lists.xml.org
Subject: Re: [xml-dev] generating DOM from ill-formed HTML docs
From: Mike Champion <mc@xegesis.org>
Date: Sun, 14 Jul 2002 23:29:29 -0400
In-reply-to: <20020715020006.44124.qmail@web9807.mail.yahoo.com>

7/14/2002 10:00:06 PM, Robert Mena <rt_mena@yahoo.com> wrote:

>Hi, I am developing an application that will have to
>build a DOM tree of html pages.
>
>I'll use such DOM trees to perform some
>analysis/comparisons.
>
>Since most of the time I'll find ill-formed documents
>I'd like to know if there are any parsers out there
>that "accept" this flaws and builds the tree anyway.
>
>I've tried domxml (php) with no luck.

The standard answer is to use tidy to convert to XHTML.  
http://tidy.sourceforge.net/ and then parse it with an
ordinary XML parser.

The possibly wacky answer is to use Javascript in a browser if
at all possible.  For better or worse (mostly worse!) the browser
vendors have worked hard to "accept the flaws and build a
tree anyway", and then expose that tree with the DOM API. You 
can essentially pretend the ill-formed HTML is XML and use the
XML Core DOM to work with it.  You might need to use some
server-side PHP or whatever to grab web pages, filter in the
Javascript code, and feed it to the browser to work around
the browser/Javascript "sandbox" limitations.

Follow-Ups:
- RE: [xml-dev] generating DOM from ill-formed HTML docs
  - From: "Ramin Firoozye" <ramin@wizen.com>

References:
- generating DOM from ill-formed HTML docs
  - From: Robert Mena <rt_mena@yahoo.com>

Prev by Date: Re: [xml-dev] generating DOM from ill-formed HTML docs
Next by Date: Re: [xml-dev] XML Diff Markup
Previous by thread: Re: [xml-dev] generating DOM from ill-formed HTML docs
Next by thread: RE: [xml-dev] generating DOM from ill-formed HTML docs
Index(es):
- Date
- Thread