OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Parsing HTML in Perl



Perhaps of interest to you perl programmers out there.

I asked one of our programmers (Gabe Schaeffer) to write a function to
parse a malformed HTML file, prior to converting it to XHTML. Here is
what he produced!

I've never seen an HTML file parsed with a single line of Perl RegEx
before!

sub ParseHTML
{
# pass in an HTML string to be parsed and a boolean indicating if
whitespace between elements should be trimmed;
# returns a dictionary with the elements in the string
 my ($html, $trim) = @_;
 my $i, $element, $dict;
 $dict = $Server->CreateObject("Scripting.Dictionary");

 foreach $element ($html =~
/(.*?)(<(?:(?:!--.*?--)|(?:\/?[a-z0-9_:.-]+(?:\s+[a-z0-9_:.-]+(?:=(?:[
^> '"\t\n]+|(?:'.*?')|(?:".*?")))?)*))\s*\/?\s*>)/isg)
 {
  $element = TrimWS($element) if $trim;
  $dict->Add($i++, ParseTag($element, $trim)) if length $element;
 }
 return $dict;
}