XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] How to parse XML document with default namespace with JDOM XPath

Hi Michael,
 
Thanks for responding to this question.
 
I have not had any luck with jdom-interest@jdom.org forum at all since subscribing to them a few months back.
 
In the meantime, can you confirm that it is not possible to use Sax 6.5.x with JDOM according to http://www.cafeconleche.org/books/xmljava/chapters/ch16s05.html? Or is it because you are not familiar with JDOM?
 
Could anyone point me to a more useful JDOM forum to assistance with this question?
 
Many thanks,
 
Jack


From: Michael Kay <mike@saxonica.com>
To: Jack Bush <netbeansfan@yahoo.com.au>; xml-dev@lists.xml.org
Sent: Wednesday, 5 November, 2008 12:39:48 AM
Subject: RE: [xml-dev] How to parse XML document with default namespace with JDOM XPath

I see no Saxon code here. You are using the XPath engine that comes with JDOM. You might be better off asking on the JDOM list. I have to confess I'm surprised to see you declaring namespaces AFTER compiling the XPath expression, but I can't say I'm familiar with this API.
 
Michael Kay
http://www.saxonica.com/


From: Jack Bush [mailto:netbeansfan@yahoo.com.au]
Sent: 04 November 2008 13:02
To: xml-dev@lists.xml.org
Subject: [xml-dev] How to parse XML document with default namespace with JDOM XPath

Hi All,

 

I am having difficulty parsing using Saxon and TagSoup parser on a namespace html document. The relevant content of this document are as follows:

 

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

……..

</head>

<body>

    <div id="container">

        <div id="content">

            <table class="sresults">

                <tr>

                    <td>

                        <a href="http://www.abc.com/areas" title=" Hollywood , CA "> hollywood </a>

                    </td>

                    <td>

                        <a href="http://www.abc.com/areas" title=" San Jose , CA "> san jose </a>

                    </td>

                    <td>

                        <a href="http://www.abc.com/areas" title=" San Francisco , CA "> san francisco </a>

                    </td>

                    <td>

                        <a href="http://www.abc.com/areas" title=" San Diego , CA "> San diego </a>

                    </td>

              </tr>

……….

</body>

</html>

 

Below is the relevant code snippets illustrates how I have attempted to retrieve the contents (value of  <a>):

 

             import java.util.*;

             import org.jdom.*;

             import org.jdom.xpath.*;

             import org.saxpath.*;

             import org.ccil.cowan.tagsoup.Parser;

 

( 1 )       frInHtml = new FileReader("C:\\Tmp\\ABC.html");

( 2 )       brInHtml = new BufferedReader(frInHtml);

( 3 ) //    SAXBuilder saxBuilder = new SAXBuilder("org.apache.xerces.parsers.SAXParser");

( 4 )       SAXBuilder saxBuilder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser");

( 5 )       org.jdom.Document jdomDocument = saxbuilder.build(brInHtml);

( 6 )       XPath xpath =  XPath.newInstance("/ns:html/ns:body/ns:div[@id='container']/ns:div[@id='content']/ns:table[@class='sresults']/ns:tr/ns:td/ns:a");

( 7 )       xpath.addNamespace("ns", "http://www.w3.org/1999/xhtml");

( 8 )       java.util.List list = (java.util.List) (xpath.selectNodes(jdomDocument));

( 9 )       Iterator iterator = list.iterator();

( 10 )     while (iterator.hasNext())

( 11 )     {

( 12 )            Object object = iterator.next();

( 13 ) //         if (object instanceof Element)

( 14 ) //               System.out.println(((Element)object).getTextNormalize());

( 15 )             if (object instanceof Content)

( 16 )                   System.out.println(((Content)object).getValue());

              }

….

 

This program would work on the same document without the default namespace, hence, it would not be necessary to include “ns” prefix along in the XPath statements (line 6-7) either. Moreover, I was using “org..apache.xerces.parsers.SAXParser” to have successfully retrieve content of <a> from the same document without default namespace in the past.

 

I would like to achieve the following objectives if possible:

 

( i ) Exclude DTD and namespace in order to simplifying the parsing process. How this could be done?

( ii ) If this is not possible, how to include it in XPath statements (line 6-7) so that the value of <a> is picked up correctly?

( iii ) Would changing from “org.apache.xerces.parsers.SAXParser” to “org.ccil.cowan.tagsoup.Parser” make any difference as far as using XPath is concerned?

( iv ) Failing to exlude DTD, how to change the lookup of a PUBLIC DTD to a local SYSTEM one and include a local DTD for reference?

 

I am running JDK 1.6.0_06, Netbeans 6.1, JDOM 1.1, Saxon6-5-5, Tagsoup 1.2 on Windows XP platform.

 

Any assistance would be appreciated.

 

Thanks in advance,

 

Jack



Search 1000's of available singles in your area at the new Yahoo!7 Dating. Get Started.


Search 1000's of available singles in your area at the new Yahoo!7 Dating. Get Started.

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS