OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [xml-dev] Handling internal general entities with SAX

At 9:29 AM -0700 2001-10-22, David Brownell wrote:
> At 8:37 PM -0700 2001-10-21, Devlin, Kurt wrote:
>> The reason for this is that we are taking our XML to several
>> different output formats and each will want to handle some
>> entities differently.
>The normal way to do that involves each output stream having
>different entity declarations.  That means each must have a
>different DTD, either with different external subsets or with
>conditional sections or (most simply) like
>    <!DOCTYPE my-app-rootnode
>        SYSTEM http://www.example.com/dtds/my-app.dtd
>    [
>    <!ENTITY test "[this is a test]">
>    ]>
>Alternatively, some folk have adopted "no DTD" policies for
>the data they interchange, and then paste their own DTDs
>(with entity declarations) in front of files.  It's easy enough to
>splice one Reader (or InputStream) in front of another, using
>an InputStream.
>- Dave

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 


Since you're at WestGroup, I suspect that you might be working
with a pretty large set of documents, and you might find Dave's
suggestion of using conditional sections worthwhile.

Here's an example of conditional sections that I set up this past
summer for either (a) producing HTML files or (b) loading an OODB
system.  The "load-oodb" entity is set to INCLUDE when documents
are being loaded into an OODB that back-ends some web servers. 
The "make-html" entity is set to INCLUDE when generating plain
HTML files to go onto CD-ROM.

    <!--    NOTE!
            Activate either load-oodb or make-html by setting 
            it to "INCLUDE".  Set the other one to "IGNORE".
            The "obj-article", "obj-chapter", and "obj-page"
            entities identify the doctypes/object types in the
            Versant OODB.
    <!ENTITY % load-oodb "IGNORE"   >
    <!ENTITY % make-html "INCLUDE"  >
    <!-- for loading the OODB -->
    <![ %load-oodb; [
    <!ENTITY servlet              "/servlet/handler?id="   > 
    <!ENTITY obj-article          "&amp;obj=Article"       > 
    <!ENTITY obj-chapter          "&amp;obj=Chapter"       > 
    <!ENTITY obj-page             "&amp;obj=Page"          > 
    <!ENTITY main_nav   SYSTEM    "main_nav-servlet.inc"   >
    <!-- for making HTML files -->
    <![ %make-html; [
    <!ENTITY servlet              ""                        >
    <!ENTITY obj-article          ".html"                   > 
    <!ENTITY obj-chapter          ".html"                   > 
    <!ENTITY obj-page             ".html"                   > 
    <!ENTITY main_nav SYSTEM      "main_nav-html.inc"       >

There is a single list of entities for of the 1,200 or so web pages
that currently exist or can be generated, one entity per page, in a
single collection of declarations like this:

    <!ENTITY link-ab_partners  '&servlet;ab_partners&obj-page;' >
    <!ENTITY link-askus        '&pods-servlet;askus&obj-page;' >
    <!ENTITY link-contact_us   '&pods-servlet;contact_us&obj-page;' >

et cetera, and throughout our XML the links are encoded like this:

    learn about <LINK ref="&link-ab_partners;">our partners</LINK>.

which, because the big list of entity declarations containing
link-ab_partners has already been read, gets replaced with


which is then replaced with one of these:


depending on whether which of the two initial entities (load-oodb
or make-html) has been set to INCLUDE.

In a similar fashion, I've set up the use of entities throughout
the (current) 1,600+ documents for over 200 images, several
hundred "body" inclusions, and about 3,650 links, and both Xerces
and SP exhibit virtually no performance hit compared to processing
the equivalent set of documents with all of the entities already
fully expanded.  The system could easily be extended to support
SVG, CGM, EPS, Flash, or some other graphic format for specialized
CD-ROMs or print output, or some other linking mechanism for
proprietary systems like DynaText or Folio.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

At 11:45 AM -0700 2001-10-22, Devlin, Kurt wrote:
>Yes, I realize that I want to "break" the XML rules, but I feel
>like my intentions are good.
>We definitely fall into the "no DTD" group for our data
>exchange.  I had considered chaining an InputStream in before the
>Reader to "import" the entity declarations.  This handles the
>case for all of the known entities, but not for unknown ones.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 


Danger, Will Robinson -- unknown monster coming toward us! 

I'm assuming that you are going to be processing documents such
as statutes or case law, not data-centric XML such as purchase

Without DTDs you will almost certainly end up building all sorts
of custom "validation" into your software.  It rarely turns out
nicely.  The code is usually developed in and maintained with
fixes, updates, and patches accumulating as new variations in the
input documents are discovered.  

It's better to analyze the documents and develop your own DTDs for
them if you can't get DTDs from the people who care creating the
documents.  As well as being used for XML validation, the DTDs can
act as the documentation of your understanding of the allowable
structures in the documents instead of burying that understanding
in your programming.  When you get some new variation that isn't
valid, a validating parser makes it clear as to what the variation
is, thus making the updating of your software much cleaner and

/s/ Ernest G. Allen
    Sunnyvale, CA, USA