[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: XML Schemas: Best Practices
- From: "Roger L. Costello" <costello@mitre.org>
- To: xml-dev@lists.xml.org
- Date: Tue, 16 Jan 2001 18:05:12 -0500
Hi Folks,
I would like to start on a new issue. I think that this issue will
generate a lot of interest, as it is critical to designing robust
schemas.
Issue: What is Best Practice for creating extensible content models?
Below I have jotted down some initial thoughts on this subject. I
am sure that I have missed many techniques for creating extensible
content models. What are your thoughts on this topic?
Techniques for Creating Extensible Content Models
[1] Use types to create extensible content models. Consider this
schema snippet:
<element name="BookCatalogue">
<complexType>
<sequence>
<element name="Book" minOccurs="0"
maxOccurs="unbounded">
<complexType>
<sequence>
<element name="Title" type="string"/>
<element name="Author" type="string"/>
<element name="Date" type="year"/>
<element name="ISBN" type="string"/>
<element name="Publisher" type="string"/>
</sequence>
</complexType>
</element>
</sequence>
</complexType>
</element>
This schema snippet dictates that in instance documents <Book> elements
must always be comprised of exactly 5 elements <Title>, <Author>,
<Date>, <ISBN>, and <Publisher>. For example:
<Book>
<Title>The First and Last Freedom</Title>
<Author>J. Krishnamurti</Author>
<Date>1954</Date>
<ISBN>0-06-064831-7</ISBN>
<Publisher>Harper & Row</Publisher>
</Book>
The schema creates instance documents that are completely static and
non extensible.
On the other hand, consider this version of the schema, where I have
defined Book's content model with a type definition:
<complexType name="BookType">
<sequence>
<element name="Title" type="string"/>
<element name="Author" type="string"/>
<element name="Date" type="year"/>
<element name="ISBN" type="string"/>
<element name="Publisher" type="string"/>
</sequence>
</complexType>
<element name="BookCatalogue">
<complexType>
<sequence>
<element name="Book" type="c:BookType" minOccurs="0"
maxOccurs="unbounded"/>
</sequence>
</complexType>
</element>
Recall that via the mechanism of type substitutability, the contents
of <Book> can be substituted by any type that derives from BookType.
For example, if we create a type which derives from BookType:
<complexType name="BookTypePlusReviewer">
<complexContent>
<extension base="c:BookType" >
<sequence>
<element name="Reviewer" type="string"/>
</sequence>
</extension>
</complexContent>
</complexType>
then instance documents can create a <Book> element that
contains a <Reviewer> element, along with the other five elements:
<Book xsi:type="BookTypePlusReviewer">
<Title>My Life and Times</Title>
<Author>Paul McCartney</Author>
<Date>1998</Date>
<ISBN>94303-12021-43892</ISBN>
<Publisher>McMillin Publishing</Publisher>
<Reviewer>Roger Costello</Reviewer>
</Book>
In my example, I defined BookTypePlusReviewer within the same
schema as BookType. In general, however, this may not be the case.
Other schemas can import the BookCatalogue schema and define types
which derive from BookType. Thus, the contents of Book may be
extended, without modifying the BookCatalogue schema!
This type substitutability mechanism is a powerful extensibility
mechanism. However, it suffers from two problems:
[1] Location Restricted Extensibility: The extensibility is restricted
to appending elements onto the end of the content model
(after the <Publisher> element). What if we wanted to extend
<Book> by adding elements to the beginning (before <Title>), or in
the middle, etc? We can't do it with this mechanism.
[2] Unexpected Extensibility: If you look at the declaration for Book:
<element name="Book" type="c:BookType" minOccurs="0"
maxOccurs="unbounded"/>
and the definition for BookType:
<complexType name="BookType">
<sequence>
<element name="Title" type="string"/>
<element name="Author" type="string"/>
<element name="Date" type="year"/>
<element name="ISBN" type="string"/>
<element name="Publisher" type="string"/>
</sequence>
</complexType>
it is easy to be fooled into thinking that in instance documents the
<Book> elements will always contain just <Title>, <Author>, <Date>,
<ISBN>, and <Publisher>. It is easy to forget that someone could
extend the content model using the type substitutability mechanism.
Extensibility is unexpected! Consequently, if you write a program to
process BookCatalogue instance documents, you may forget to take into
account the fact that a <Book> element may contain more than five
children.
It would be nice if there was a way to explicitly flag places where
extensibility may occur: "hey, instance documents may extend <Book> at
this point, so be sure to write your code taking this possibility into
account." In addition, it would be nice if we could extend Book's
content model at locations other than just the end ... The <any>
element gives us these capabilities beautifully:
<element name="BookCatalogue">
<complexType>
<sequence>
<element name="Book" type="minOccurs="0"
maxOccurs="unbounded">
<complexType>
<sequence>
<element name="Title" type="string"/>
<element name="Author" type="string"/>
<element name="Date" type="year"/>
<element name="ISBN" type="string"/>
<element name="Publisher" type="string"/>
<any namespace="##any" minOccurs="0"/>
</sequence>
</complexType>
</element>
</sequence>
</complexType>
</element>
In this version of the schema I have made explicit the fact that after
the <Publication> element any well-formed XML element may occur and
the XML element may come from any namespace.
Note that I could have put the <any> element within a BookType:
<complexType name="BookType">
<sequence>
<element name="Title" type="string"/>
<element name="Author" type="string"/>
<element name="Date" type="year"/>
<element name="ISBN" type="string"/>
<element name="Publisher" type="string"/>
<any namespace="##any" minOccurs="0" maxOccurs="1"/>
</sequence>
</complexType>
and then declared Book to be of type BookType:
<element name="Book" type="c:BookType" minOccurs="0"
maxOccurs="unbounded"/>
However, then we are back to the "unexpected extensibility" problem.
Namely, after the <Publication> element any well-formed XML element
may occur. After that, anything could be present.
Thus, I chose not to use a type so that I could control the
extensibility.
There is another way to control the extensibility and still use a type.
I can use the BookType and add a block attribute to Book:
<element name="Book" type="c:BookType" block="#all"
minOccurs="0" maxOccurs="unbounded"/>
The block attribute prohibits derived types from being used in
Book's content model. I prefer this later way of controlling
extensibility than the in-line version because it creates a reusable
component (BookType), and yet we still have control over the
extensibility.
With the <any> element we have complete control over where, and how
much extensibility we want to allow. For example, suppose that we
want to enable there to be at most two new elements at the top of
Book's content model. Here's how to specify that using the <any>
element:
<complexType name="BookType">
<sequence>
<any namespace="##any" minOccurs="0" maxOccurs="2"/>
<element name="Title" type="string"/>
<element name="Author" type="string"/>
<element name="Date" type="year"/>
<element name="ISBN" type="string"/>
<element name="Publisher" type="string"/>
</sequence>
</complexType>
Note how I have placed the <any> element at the top of the content
model, and have set maxOccurs="2". Thus, in instance documents the
<Book> content will always end with <Title>, <Author>, <Date>, <ISBN>,
and <Publisher>. Prior to that, two well-formed XML elements may
occur.
I must admit that I am biased towards using the <any> element as a
mechanism for achieving content model extensibility. It provides much
greater control for where extensibility occurs and how much occurs. In
addition, I like the fact that it alerts me to where extensibility may
occur, so I can write my programs to process the content model
appropriately. I don't like surprises in my data.
What are your thoughts on this topic? I am sure that in my bias, I
am missing some disadvantages of using the <any> element. Can you
think of any disadvantages? What other techniques are there for
extending content models? /Roger