[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Parsing HTML in Perl
- From: Frank Boumphrey <bckman@ix.netcom.com>
- To: "XML-DEV (E-mail)" <xml-dev@lists.xml.org>
- Date: Wed, 09 May 2001 16:36:31 -0400
Perhaps of interest to you perl programmers out there.
I asked one of our programmers (Gabe Schaeffer) to write a function to
parse a malformed HTML file, prior to converting it to XHTML. Here is
what he produced!
I've never seen an HTML file parsed with a single line of Perl RegEx
before!
sub ParseHTML
{
# pass in an HTML string to be parsed and a boolean indicating if
whitespace between elements should be trimmed;
# returns a dictionary with the elements in the string
my ($html, $trim) = @_;
my $i, $element, $dict;
$dict = $Server->CreateObject("Scripting.Dictionary");
foreach $element ($html =~
/(.*?)(<(?:(?:!--.*?--)|(?:\/?[a-z0-9_:.-]+(?:\s+[a-z0-9_:.-]+(?:=(?:[
^> '"\t\n]+|(?:'.*?')|(?:".*?")))?)*))\s*\/?\s*>)/isg)
{
$element = TrimWS($element) if $trim;
$dict->Add($i++, ParseTag($element, $trim)) if length $element;
}
return $dict;
}