How to parse XML document with default namespace with JDOM XPath

Hi All,

I am having difficulty parsing using Saxon and TagSoup parser on a namespace html document. The relevant content of this document are as follows:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<head>

……..

</head>

<body>

<tr>

<td>

<a href="http://www.abc.com/areas" title="Hollywood, CA">hollywood</a>

</td>

<td>

</td>

<td>

<a href="http://www.abc.com/areas" title="San Francisco, CA">san francisco</a>

</td>

<td>

<a href="http://www.abc.com/areas" title="San Diego, CA">San diego</a>

</td>

</tr>

……….

</body>

</html>

Below is the relevant code snippets illustrates how I have attempted to retrieve the contents (value of <a>):

import java.util.*;

import org.jdom.*;

import org.jdom.xpath.*;

import org.saxpath.*;

import org.ccil.cowan.tagsoup.Parser;

( 1 ) frInHtml = new FileReader("C:\\Tmp\\ABC.html");

( 2 ) brInHtml = new BufferedReader(frInHtml);

( 3 ) // SAXBuilder saxBuilder = new SAXBuilder("org.apache.xerces.parsers.SAXParser");

( 4 ) SAXBuilder saxBuilder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser");

( 5 ) org.jdom.Document jdomDocument = saxbuilder.build(brInHtml);

( 6 ) XPath xpath = XPath.newInstance("/ns:html/ns:body/ns:div[@id='container']/ns:div[@id='content']/ns:table[@class='sresults']/ns:tr/ns:td/ns:a");

( 7 ) xpath.addNamespace("ns", "http://www.w3.org/1999/xhtml");

( 8 ) java.util.List list = (java.util.List) (xpath.selectNodes(jdomDocument));

( 9 ) Iterator iterator = list.iterator();

( 10 ) while (iterator.hasNext())

( 11 ) {

( 12 ) Object object = iterator.next();

( 13 ) // if (object instanceof Element)

( 14 ) // System.out.println(((Element)object).getTextNormalize());

( 15 ) if (object instanceof Content)

( 16 ) System.out.println(((Content)object).getValue());

}

….

This program would work on the same document without the default namespace, hence, it would not be necessary to include “ns” prefix along in the XPath statements (line 6-7) either. Moreover, I was using “org..apache.xerces.parsers.SAXParser” to have successfully retrieve content of <a> from the same document without default namespace in the past.

I would like to achieve the following objectives if possible:

( i ) Exclude DTD and namespace in order to simplifying the parsing process. How this could be done?

( ii ) If this is not possible, how to include it in XPath statements (line 6-7) so that the value of <a> is picked up correctly?

( iii ) Would changing from “org.apache.xerces.parsers.SAXParser” to “org.ccil.cowan.tagsoup.Parser” make any difference as far as using XPath is concerned?

( iv ) Failing to exlude DTD, how to change the lookup of a PUBLIC DTD to a local SYSTEM one and include a local DTD for reference?

I am running JDK 1.6.0_06, Netbeans 6.1, JDOM 1.1, Saxon6-5-5, Tagsoup 1.2 on Windows XP platform.

Any assistance would be appreciated.

Thanks in advance,

Jack