[
Lists Home |
Date Index |
Thread Index
]
As most people on this list will know, OpenOffice.org documents
are stored as XML within a ZIP format file. The main file
within the ZIP is called content.xml and starts with:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE office:document-content
PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN"
"office.dtd">
(line breaks added for mailability).
The "office.dtd" system identifier here is a relative URI but,
because the DTD is not in the document ZIP and is probably not
even in the same directory as the document it is awfully
confusing to any XML processor.
A while ago I processed some OOo document contents using Saxon
by the expedient of hand editing the small number of files to
delete the doctype declarations - which felt dreadfully wrong.
Now I want to read the contents of some other files using JDom
which exceptions when it can't resolve the DTD - the document
is coming from an InputStream from the Java zip library so
doesn't have a base URI.
Solutions considered include filtering the DOCTYPE declaration
out of the file or doing a custom EntityResolver. I tried with
a toy EntityResolver but that didn't seem to get called, maybe
that's an issue with my program or JDom though - the O'Reilly
book OpenOffice.org XML Essentials on the xml.openoffice.org
site says to use this technique with a programmatic call to
invoke an XSLT transformation.
My questions:
1. Is having a system id which doesn't actually refer to a DTD
a sign of faulty XML (i.e., not valid or not well formed)?
2. Is this true even if the public identifier is OK?
3. What is the best way to deal with this case in a program
using a SAX reader?
4. What is the best way to deal with it when using a standalone
XML tool like an XSLT program?
5. Would it help if OOo included standalone="no" in the XML
declaration? (If the processor isn't validating and knows
the document is standalone then presumably it doesn't have
any reason to read the DTD?)
Ed Davies
|