[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Parsing HTML in Perl

From: Frank Boumphrey <bckman@ix.netcom.com>
To: "XML-DEV (E-mail)" <xml-dev@lists.xml.org>
Date: Wed, 09 May 2001 16:36:31 -0400

Perhaps of interest to you perl programmers out there.

I asked one of our programmers (Gabe Schaeffer) to write a function to
parse a malformed HTML file, prior to converting it to XHTML. Here is
what he produced!

I've never seen an HTML file parsed with a single line of Perl RegEx
before!

sub ParseHTML
{
# pass in an HTML string to be parsed and a boolean indicating if
whitespace between elements should be trimmed;
# returns a dictionary with the elements in the string
 my ($html, $trim) = @_;
 my $i, $element, $dict;
 $dict = $Server->CreateObject("Scripting.Dictionary");

 foreach $element ($html =~
/(.*?)(<(?:(?:!--.*?--)|(?:\/?[a-z0-9_:.-]+(?:\s+[a-z0-9_:.-]+(?:=(?:[
^> '"\t\n]+|(?:'.*?')|(?:".*?")))?)*))\s*\/?\s*>)/isg)
 {
  $element = TrimWS($element) if $trim;
  $dict->Add($i++, ParseTag($element, $trim)) if length $element;
 }
 return $dict;
}

Follow-Ups:
- Re: Parsing HTML in Perl
  - From: Matt Sergeant <matt@sergeant.org>

Prev by Date: XML 2001 Call for Presentations, Tutorials, Exhibits and Sponsors
Next by Date: Re: Question: Namespace And XSL
Previous by thread: XML 2001 Call for Presentations, Tutorials, Exhibits and Sponsors
Next by thread: Re: Parsing HTML in Perl
Index(es):
- Date
- Thread