After all the discussion about "What is data?" I don't know if this list is the place to discuss actual details of implementation, but please feel free to send me elsewhere if you can think of a better venue.
For my part, I find it refreshing a place where one can discuss such fundamental matters as well as the lineaments of running code. I think you'll find in the archives plenty of discussion of code, and plenty of code-free discussion alike.
I have a need to handle XML that references a non-existent DTD. The DTD is irrelevant to the actual processing of the XML, and isn't available anywhere, but it is declared in in the DOCTYPE. I'm sure many of you have encountered this situation: it's practically the norm, in my experience.
After years of dealing with this inherently unsatisfactory situation in a variety of ways, I came up with a new one that I am liking at the moment, which is to insert a Stream into a Java XML processing stack that strips out the prolog of the XML document before handing it off to a parser. This has the nice property that it doesn't require modifications to the stored XML files. It loses PIs and comments and the XML decl, but I can live with that.
Expat allows you to specify a standalone flag, which in effect expunges all external parameter entity declarations (and other such external resources incompatible with standalone="yes"). This certainly skates the edges of XML spec compliance, but I think it's legit, because I see it as an implicit transform. Anyway, your Java tools might have the equivalent. FWIW, I know that Jython 2.5 includes Expat wrapped for the core XMl libs, so that might be an option.
In Amara 2.x we expose this flag very conveniently. You can do:
import amara
doc = amara.parse(myxml, standalone=True) #flag uses boolean values, not strings
And it will in effect ignore those pesky parameter entitiy decls, including declarations of external subset.
The rest of your post is Java-specific, so I'll snip and run like hell :)