[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XML Schemas: Best Practices

From: "Roger L. Costello" <costello@mitre.org>
To: xml-dev@lists.xml.org
Date: Tue, 16 Jan 2001 18:05:12 -0500

Hi Folks,

I would like to start on a new issue.  I think that this issue will
generate a lot of interest, as it is critical to designing robust
schemas.

Issue: What is Best Practice for creating extensible content models?

Below I have jotted down some initial thoughts on this subject.  I
am sure that I have missed many techniques for creating extensible
content models. What are your thoughts on this topic?

Techniques for Creating Extensible Content Models

[1] Use types to create extensible content models.  Consider this 
    schema snippet:

    <element name="BookCatalogue">
        <complexType>
             <sequence>
                 <element name="Book" minOccurs="0" 
                          maxOccurs="unbounded">
                     <complexType>
                         <sequence>
                             <element name="Title" type="string"/>
                             <element name="Author" type="string"/>
                             <element name="Date" type="year"/>
                             <element name="ISBN" type="string"/>
                             <element name="Publisher" type="string"/>
                         </sequence>
                     </complexType>
                 </element>
            </sequence>
        </complexType>
    </element>

This schema snippet dictates that in instance documents <Book> elements
must always be comprised of exactly 5 elements <Title>, <Author>, 
<Date>, <ISBN>, and <Publisher>.  For example:

     <Book>
          <Title>The First and Last Freedom</Title>
          <Author>J. Krishnamurti</Author>
          <Date>1954</Date>
          <ISBN>0-06-064831-7</ISBN>
          <Publisher>Harper &amp; Row</Publisher>
     </Book>

The schema creates instance documents that are completely static and 
non extensible.

On the other hand, consider this version of the schema, where I have 
defined Book's content model with a type definition:

     <complexType name="BookType">
        <sequence>
            <element name="Title" type="string"/>
            <element name="Author" type="string"/>
            <element name="Date" type="year"/>
            <element name="ISBN" type="string"/>
            <element name="Publisher" type="string"/>
        </sequence>
    </complexType>
    <element name="BookCatalogue">
        <complexType>
             <sequence>
                 <element name="Book" type="c:BookType" minOccurs="0" 
                          maxOccurs="unbounded"/>
            </sequence>
        </complexType>
    </element>

Recall that via the mechanism of type substitutability, the contents 
of <Book> can be substituted by any type that derives from BookType.  
For example, if we create a type which derives from BookType: 

    <complexType name="BookTypePlusReviewer">
        <complexContent>
            <extension base="c:BookType" >
                <sequence>
                    <element name="Reviewer" type="string"/>
                </sequence>
            </extension>
        </complexContent>
    </complexType>

then instance documents can create a <Book> element that
contains a <Reviewer> element, along with the other five elements:

        <Book xsi:type="BookTypePlusReviewer">
             <Title>My Life and Times</Title>
             <Author>Paul McCartney</Author>
             <Date>1998</Date>
             <ISBN>94303-12021-43892</ISBN>
             <Publisher>McMillin Publishing</Publisher>
             <Reviewer>Roger Costello</Reviewer>
        </Book>

In my example, I defined BookTypePlusReviewer within the same
schema as BookType.  In general, however, this may not be the case.
Other schemas can import the BookCatalogue schema and define types 
which derive from BookType.  Thus, the contents of Book may be 
extended, without modifying the BookCatalogue schema!

This type substitutability mechanism is a powerful extensibility 
mechanism.  However, it suffers from two problems:

[1] Location Restricted Extensibility: The extensibility is restricted  
    to appending elements onto the end of the content model 
    (after the <Publisher> element).  What if we wanted to extend 
    <Book> by adding elements to the beginning (before <Title>), or in 
    the middle, etc?  We can't do it with this mechanism.  

[2] Unexpected Extensibility: If you look at the declaration for Book:

     <element name="Book" type="c:BookType" minOccurs="0" 
              maxOccurs="unbounded"/>

and the definition for BookType:

     <complexType name="BookType">
        <sequence>
            <element name="Title" type="string"/>
            <element name="Author" type="string"/>
            <element name="Date" type="year"/>
            <element name="ISBN" type="string"/>
            <element name="Publisher" type="string"/>
        </sequence>
    </complexType>

it is easy to be fooled into thinking that in instance documents the
<Book> elements will always contain just <Title>, <Author>, <Date>, 
<ISBN>, and <Publisher>.  It is easy to forget that someone could 
extend the content model using the type substitutability mechanism.  
Extensibility is unexpected! Consequently, if you write a program to 
process BookCatalogue instance documents, you may forget to take into
account the fact that a <Book> element may contain more than five 
children. 

It would be nice if there was a way to explicitly flag places where
extensibility may occur: "hey, instance documents may extend <Book> at 
this point, so be sure to write your code taking this possibility into 
account."  In addition, it would be nice if we could extend Book's 
content model at locations other than just the end ... The <any> 
element gives us these capabilities beautifully:

    <element name="BookCatalogue">
        <complexType>
             <sequence>
                 <element name="Book" type="minOccurs="0" 
                          maxOccurs="unbounded">
                     <complexType>
                         <sequence>
                             <element name="Title" type="string"/>
                             <element name="Author" type="string"/>
                             <element name="Date" type="year"/>
                             <element name="ISBN" type="string"/>
                             <element name="Publisher" type="string"/>
                             <any namespace="##any" minOccurs="0"/>
                         </sequence>
                     </complexType>
                 </element>
            </sequence>
        </complexType>
    </element>
 
In this version of the schema I have made explicit the fact that after
the <Publication> element any well-formed XML element may occur and 
the XML element may come from any namespace.

Note that I could have put the <any> element within a BookType:

     <complexType name="BookType">
        <sequence>
            <element name="Title" type="string"/>
            <element name="Author" type="string"/>
            <element name="Date" type="year"/>
            <element name="ISBN" type="string"/>
            <element name="Publisher" type="string"/>
            <any namespace="##any" minOccurs="0" maxOccurs="1"/>
        </sequence>
    </complexType>

and then declared Book to be of type BookType:

    <element name="Book" type="c:BookType" minOccurs="0" 
             maxOccurs="unbounded"/>

However, then we are back to the "unexpected extensibility" problem. 
Namely, after the <Publication> element any well-formed XML element
may occur.  After that, anything could be present.  

Thus, I chose not to use a type so that I could control the 
extensibility.

There is another way to control the extensibility and still use a type.
I can use the BookType and add a block attribute to Book:

    <element name="Book" type="c:BookType" block="#all"
             minOccurs="0" maxOccurs="unbounded"/>

The block attribute prohibits derived types from being used in
Book's content model. I prefer this later way of controlling 
extensibility than the in-line version because it creates a reusable 
component (BookType), and yet we still have control over the 
extensibility.

With the <any> element we have complete control over where, and how 
much extensibility we want to allow.  For example, suppose that we 
want to enable there to be at most two new elements at the top of 
Book's content model.  Here's how to specify that using the <any>
element:

     <complexType name="BookType">
        <sequence>
            <any namespace="##any" minOccurs="0" maxOccurs="2"/>
            <element name="Title" type="string"/>
            <element name="Author" type="string"/>
            <element name="Date" type="year"/>
            <element name="ISBN" type="string"/>
            <element name="Publisher" type="string"/>
        </sequence>
    </complexType>

Note how I have placed the <any> element at the top of the content
model, and have set maxOccurs="2".  Thus, in instance documents the
<Book> content will always end with <Title>, <Author>, <Date>, <ISBN>,
and <Publisher>.  Prior to that, two well-formed XML elements may 
occur.

I must admit that I am biased towards using the <any> element as a
mechanism for achieving content model extensibility.  It provides much 
greater control for where extensibility occurs and how much occurs.  In 
addition, I like the fact that it alerts me to where extensibility may 
occur, so I can write my programs to process the content model 
appropriately.  I don't like surprises in my data.

What are your thoughts on this topic?  I am sure that in my bias, I
am missing some disadvantages of using the <any> element.  Can you
think of any disadvantages? What other techniques are there for 
extending content models?  /Roger

Follow-Ups:
- Re: XML Schemas: Best Practices
  - From: Caroline Clewlow <cclewlow@eris.dera.gov.uk>
- Re: XML Schemas: Best Practices
  - From: Martin Bryan <mtbryan@sgml.u-net.com>
- Re: XML Schemas: Best Practices
  - From: Martin Bryan <mtbryan@sgml.u-net.com>

References:
- Re: XML Schemas: Best Practices
  - From: "Roger L. Costello" <costello@mitre.org>
- Re: XML Schemas: Best Practices
  - From: Eddie Robertsson <eddie@allette.com.au>
- Re: XML Schemas: Best Practices
  - From: "Roger L. Costello" <costello@mitre.org>
- Re: XML Schemas: Best Practices
  - From: "Roger L. Costello" <costello@mitre.org>
- Re: XML Schemas: Best Practices
  - From: "Roger L. Costello" <costello@mitre.org>

Prev by Date: Re: xpath axes
Next by Date: XPointer syntax
Previous by thread: Re: XML Schemas: Best Practices
Next by thread: Re: XML Schemas: Best Practices
Index(es):
- Date
- Thread