Hi Michael,
The following statements generated state.xml file:
URL stateUrl = new URL("http://www.abc.com");
URLConnection stateconnection = stateUrl.openConnection();
stateisInHtml = stateconnection.getInputStream();
statedisInHtml = new DataInputStream(new BufferedInputStream(stateisInHtml));
System.out.flush();
statefosOutHtml = new FileOutputStream("state.html");
while ((oneChar=statedisInHtml.read()) != -1)
statefosOutHtml.write(oneChar);
.....
statefrInHtml = new FileReader("state.html");
statebrInHtml = new BufferedReader(statefrInHtml);
SAXBuilder statesaxBuilder = new SAXBuilder("org.ccil.cowan.tagsoup..Parser", false);
org.jdom.Document statejdomDocument = statesaxBuilder.build(statebrInHtml);
XMLOutputter stateoutputter = new XMLOutputter();
statefwOutXml = new FileWriter("state.xml");
statebwOutXml = new BufferedWriter(statefwOutXml);
stateoutputter.output(statejdomDocument, statebwOutXml);
XPath had no problem looking up state.xml.
Thanks,
Jack
java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence.There's only one explanation of that: the parser is expecting the document to be encoded in UTF-8 but it isn't. To understand why it isn't, you need to examine how the document was created and any transcodings that might have taken place before it reached the parser.
at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.skipChar(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache..xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces..parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:489)
at org.jdom..input.SAXBuilder.build(SAXBuilder.java:928)at JDOMTrAXPojoInvestmentBean.main(JDOMTrAXPojoInvestmentBean.java:45)The header of state.xml is as follows:<?xml version="1.0" encoding="UTF-8" ?><!DOCTYPE html (View Source for full doctype...)>
Any ideas on what is the cause of this issue and how to overcome it? Likewise, how to define the correct proper namespace prefix? Is it possible that this document has two namespaces. A default one and one with prefix 'html'? If so, which one should I use?It's certainly inelegant to bind the same namespace to two prefixes like this, though it's not incorrect. Again to prevent it happening we need to understand how you created the document.Michael Kay