OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[xml-dev] Perl XML::Parser and nested external entity references


I'm wondering if anyone has experience using Perl's XML::Parser module
to parse an XML document represented as a tree of .xml files connected
by external entity references.  (Apologies if this question is
inappropriate for xml-dev.)  I'm having a problem with the relative URIs
(here, simply relative file system pathnames) used in such a situation.
The problem is illustrated by the following example.

Here's a small tree of files:


    +- dtd-dir/
    |    |
    |    +- more-ents-dir/
    |    |    |
    |    |    +- adolph.ent
    |    |
    |    +- thangle.dtd
    +- hub.xml
    +- some-shared-content
         +- eleanor.xml

The top level document is hub.xml:

  <?xml version="1.0" encoding="us-ascii"?>
  <!DOCTYPE thang PUBLIC "-//blub//DTD thangle//EN" "dtd-dir/thangle.dtd"[

The dtd it references is thangle.dtd:

  <!-- thangle.dtd -->
  <!ELEMENT thang (bang) >

  <!ELEMENT bang (thud+)>

  <!ELEMENT thud (#PCDATA)>

  <!ENTITY % adolph SYSTEM "more-ents-dir/adolph.ent">


The external entity referenced from thangle.dtd is adolph.ent:

  <!ENTITY eleanor SYSTEM "../../some-shared-content/eleanor.xml">

The external entity declared in adolph.ent and referenced back in the
top level document is eleanor.xml:

      This element and its parent are from top/some-shared-content/eleanor.xml.


I'm using the XML::Parser ExternEnt hook to handle the external entity
reference events.  This handler is given the following parameters:

  $xp    - reference to the XML::Parser::Expat instance thats running
           the parse
  $base  - base to be used for resolving a relative URI (may be
  $sysid - the URI of external entity
  $pubid - the PUBID of the external entity (may be undefined)

When XML::Parser gets to the %adolph; reference inside of thangle.dtd
(which was itself opened because of the reference to it in the DOCTYPE
declaration of hub.xml), the $base parameter comes into the handler
empty; one might think at this point that it would have a value
something like "dtd-dir/", which was the path for the previous, 'parent'
external entity reference.  At this point the handler is lost and can't
open the entity, so the parse fails.

(As an interesting aside, the Saxon 6.4.3 XSLT processor (don't know
what version of AElfred it's using) gets similarly lost in a tree of xml
fragment files connected with relative URIs, while the Xalan-J 2.0.1
XSLT processor (which uses Xerces 1.23) does not have this problem.
In the example above, Saxon *does* find the adolph.ent entity but not
the eleanor.xml one.)

I could get around this empty value for $base by keeping a pushdown list
of the relative URI 'bases' (dirname parts of URIs of previously seen
external entity references) except for one thing:  in the version of
XML::Parser I'm constrained to use, 2.27, there's no hook for the end of
an external entity reference event, only the beginning.  Without this
hook I don't know when to pop a relative base from the pushdown, and so
again get lost in the tree.  As of version 2.28 of XML::Parser there is
such a hook:  "ExternEntFin".

Is there any way to make XML::Parser 2.27 give meaningful values for the
$base parameter to one's ExternEnt handler, or am I doomed to come up
with a different Perl-accessible XML parser?

               James Miller in Austin, Texas

       Internet:   jamesm@bga.com       (
      alternate:   jamesm@wixer.bga.com (