OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Generating a DTD from XML files?



At 7:21 AM -0700 2001-05-16, Rob Lawson wrote:
>Hi,
>
>Does anyone know a utility or package to generate a basic 
>DTD from XML files?
>
>Second question, does anyone have a link to useful XML FAQs, 
>so I don't ask anymore possibly silly questions?
>
>Many thanks for any help,
>
>-------
>Robert G. Lawson (rob.lawson@ktiworld.com)
>KPM Consultant
>Knowledge Technologies International Ltd.
>Phone: +44 (0) 7866 610409
>Fax: +44 (0) 7970 030914
>http://www.ktiworld.com
>

There was a good paper at SGML '95 on this.  See "Creating DTDs 
via the GB-Engine and Fred" by Keith E. Shafer at:

  http://www.oclc.org/fred/docs/sgml95.html

See especially sections 4, "Automatic DTD Creation Process" 
and 5, "Reductions".  It should help a lot. 

I'll include an Awk program that I use as the first step when 
creating a DTD from a collection of tagged documents.  It might 
help you get started.  

It reads the ESIS output from SGMLS and writes out the full 
"path" for each element like this:

doc (
doc chapter (
doc chapter section (
doc chapter section para (
doc chapter section para )
doc chapter section para xref (
doc chapter section para xref )
doc chapter section para )
 .
 .
 .

The same approach could be used with SAX events.  

You can then write some other utilities that use this output to 
count the various nestings and have a better chance of getting the 
cardinality contstraints a little tighter, instead of just using 
* or + for each element.  At the least, it makes it easy to build 
loose content models such as (X | Y | Z)* in order to get started.  



/s/ Ernest G. Allen

//----------------------------------------------------------

##  GI_path.awk -- accepts ESIS input, writes the full path 
#   from the root element to each element start and end tag.
#
#   Uses "(" and ")" at the end of each line of output to 
#   indicate that the last GI on the line is a start tag or 
#   end tag, respectively.
#
#   by Ernest G. Allen, 1995-2001
#   
#   No copyrights held, placed into the Public Domain.
#

/^\(/ { GI = $1; GI = substr(GI, 2); push(GI); next; }

/^\)/ { pop(); }

/^\?/ { print; }

function push(s) {
    stack_ptr++;
    stack[stack_ptr] = GI;
    print_stack();
    print "(";
    return;
}

function pop() {
    print_stack();
    print ")";
    stack_ptr--;
    return;
}

function print_stack() {
    for (i=0; i<=stack_ptr; i++) {
        printf("%s ", stack[i]);
    }
}

\\----------------------------------------------------------