xml-dev - OpenOffice.org DOCTYPE declaration

OpenOffice.org DOCTYPE declaration

[ Lists Home | Date Index | Thread Index ]

To: xml-dev@lists.xml.org
Subject: OpenOffice.org DOCTYPE declaration
From: Ed Davies <edavies@nildram.co.uk>
Date: Fri, 30 Apr 2004 16:58:22 +0100

As most people on this list will know, OpenOffice.org documents
are stored as XML within a ZIP format file.  The main file 
within the ZIP is called content.xml and starts with:

  <?xml version="1.0" encoding="UTF-8"?>
  <!DOCTYPE office:document-content 
      PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN"
      "office.dtd">

(line breaks added for mailability).

The "office.dtd" system identifier here is a relative URI but, 
because the DTD is not in the document ZIP and is probably not 
even in the same directory as the document it is awfully 
confusing to any XML processor.

A while ago I processed some OOo document contents using Saxon 
by the expedient of hand editing the small number of files to
delete the doctype declarations - which felt dreadfully wrong.

Now I want to read the contents of some other files using JDom 
which exceptions when it can't resolve the DTD - the document 
is coming from an InputStream from the Java zip library so 
doesn't have a base URI.

Solutions considered include filtering the DOCTYPE declaration
out of the file or doing a custom EntityResolver.  I tried with 
a toy EntityResolver but that didn't seem to get called, maybe 
that's an issue with my program or JDom though - the O'Reilly 
book OpenOffice.org XML Essentials on the xml.openoffice.org 
site says to use this technique with a programmatic call to 
invoke an XSLT transformation.

My questions:

1. Is having a system id which doesn't actually refer to a DTD 
   a sign of faulty XML (i.e., not valid or not well formed)?  

2. Is this true even if the public identifier is OK?

3. What is the best way to deal with this case in a program
   using a SAX reader?

4. What is the best way to deal with it when using a standalone 
   XML tool like an XSLT program?

5. Would it help if OOo included standalone="no" in the XML 
   declaration?  (If the processor isn't validating and knows 
   the document is standalone then presumably it doesn't have 
   any reason to read the DTD?)

Ed Davies

Follow-Ups:
- Re: [xml-dev] OpenOffice.org DOCTYPE declaration
  - From: Liam Quin <liam@w3.org>
- Re: [xml-dev] OpenOffice.org DOCTYPE declaration
  - From: "Karl Waclawek" <karl@waclawek.net>

Prev by Date: Re: [xml-dev] You call that a standard?
Next by Date: Re: [xml-dev] VRML 1.0 and the HTML Killers
Previous by thread: RE: [xml-dev] VRML 1.0 and the HTML Killers
Next by thread: Re: [xml-dev] OpenOffice.org DOCTYPE declaration
Index(es):
- Date
- Thread