Hi All,
I am having difficulty parsing using Saxon and TagSoup parser on a namespace html document. The relevant content of this document are as follows:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
……..
</head>
<body>
<div id="container">
<div id="content">
<table class="sresults">
<tr>
<td>
<a href="http://www.abc.com/areas" title="
</td>
<td>
<a href="http://www.abc.com/areas" title="
</td>
<td>
<a href="http://www.abc.com/areas" title="
</td>
<td>
<a href="http://www.abc.com/areas" title="
</td>
</tr>
……….
</body>
</html>
Below is the relevant code snippets illustrates how I have attempted to retrieve the contents (value of <a>):
import java.util.*;
import org.jdom.*;
import org.jdom.xpath.*;
import org.saxpath.*;
import org.ccil.cowan.tagsoup.Parser;
( 1 ) frInHtml = new FileReader("C:\\Tmp\\ABC.html");
( 2 ) brInHtml = new BufferedReader(frInHtml);
( 3 ) // SAXBuilder saxBuilder = new SAXBuilder("org.apache.xerces.parsers.SAXParser");
( 4 ) SAXBuilder saxBuilder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser");
( 5 ) org.jdom.Document jdomDocument = saxbuilder.build(brInHtml);
( 6 ) XPath xpath = XPath.newInstance("/ns:html/ns:body/ns:div[@id='container']/ns:div[@id='content']/ns:table[@class='sresults']/ns:tr/ns:td/ns:a");
( 7 ) xpath.addNamespace("ns", "http://www.w3.org/1999/xhtml");
( 8 ) java.util.List list = (java.util.List) (xpath.selectNodes(jdomDocument));
( 9 ) Iterator iterator = list.iterator();
( 10 ) while (iterator.hasNext())
( 11 ) {
( 12 ) Object object = iterator.next();
( 13 ) // if (object instanceof Element)
( 14 ) // System.out.println(((Element)object).getTextNormalize());
( 15 ) if (object instanceof Content)
( 16 ) System.out.println(((Content)object).getValue());
}
….
This program would work on the same document without the default namespace, hence, it would not be necessary to include “ns” prefix along in the XPath statements (line 6-7) either. Moreover, I was using “org..apache.xerces.parsers.SAXParser” to have successfully retrieve content of <a> from the same document without default namespace in the past.
I would like to achieve the following objectives if possible:
( i ) Exclude DTD and namespace in order to simplifying the parsing process. How this could be done?
( ii ) If this is not possible, how to include it in XPath statements (line 6-7) so that the value of <a> is picked up correctly?
( iii ) Would changing from “org.apache.xerces.parsers.SAXParser” to “org.ccil.cowan.tagsoup.Parser” make any difference as far as using XPath is concerned?
( iv ) Failing to exlude DTD, how to change the lookup of a PUBLIC DTD to a local SYSTEM one and include a local DTD for reference?
I am running JDK 1.6.0_06, Netbeans 6.1, JDOM 1.1, Saxon6-5-5, Tagsoup 1.2 on Windows XP platform.
Any assistance would be appreciated.
Thanks in advance,
Jack