Issue
Should unbounded occurrences be permitted in an XML Schema?
Two Approaches to Allowing an Unbounded Number of Occurrences
There are two approaches in XML Schemas for allowing an unbounded number of
occurrences. Below I discuss these two approaches. Following that I discuss two
viewpoints on whether unbounded number of occurrences should or should not be
permitted. I then discuss the advantages and disadvantages of each viewpoint.
Finally, I propose a compromise of the two viewpoints.
Allowing Unbounded Occurrences using maxOccurs
The first approach to
allowing an unbounded number of occurrences is to explicitly state that you want
an unbounded number of occurrences by using maxOccurs="unbounded". For example,
the following declaration says that Bookstore can contain an unbounded number of
Book elements:
<element name="Bookstore">
<complexType>
<sequence>
<element name="Book" type="..." maxOccurs="unbounded"/>
</sequence>
</complexType>
</element>
Allowing Unbounded Occurrences using a Recursive Expression
The second approach to allowing an unbounded number of occurrences is less
obvious. Unboundedness occurs implicitly when you create a recursive structure.
In the following example there is no limit to the depth of the Section elements.
That is, a Section can contain a Section which contains a Section which contains
a Section ...
<element name="Section" type="SectionType"/>
<complexType name="SectionType">
<sequence>
<element name="Title" type="..."/>
<element name="Section" type="SectionType"/>
</sequence>
</complexType>
Comparison of the Two Approaches
Both of the above approaches allow an unbounded number of occurrences. Let's
compare the two approaches:
- Explicit vs Implicit: With the first approach you explicitly state
that you are allowing an unbounded number of occurrences. With the second
approach unboundedness is implicit. Although it is obvious in the example
presented, in a large Schema where the recursion extends through many
complexTypes it may not be obvious that an unbounded number of occurrences are
being allowed.
- Ability to "throttle back" on the Number of Occurrences: With the
first approach is it easy to reduce the number of occurrences. If you don't
want an unbounded number of occurrences, and want, say, only 10 occurrences
then you simply specify maxOccurs="10". With the second approach there is no
means to control the depth of the recursion. There is no means to say, "There
cannot be more than 10 deep Section elements".
Permit Unbounded Occurrences or Constrain the Occurrences?
Now that we have seen the two approaches for allowing an unbounded number of
occurrences we return to the main issue: when designing an XML Schema should you
permit an unbounded number of occurrences? There are two viewpoints on this
issue. One viewpoint is that you should design your Schemas to permit an
unbounded occurrences. The other viewpoint is that you should not permit an
unbounded number of occurrences.
To keep the discussion concrete, let's take the above Bookstore Schema as the
example. Suppose that it is decided that Bookstore should not allow more than
30,000 Books. Should the Schema be designed to allow an unbounded number of
Books:
<element name="Bookstore">
<complexType>
<sequence>
<element name="Book" type="..." maxOccurs="unbounded"/>
</sequence>
</complexType>
</element>
Or, should the Schema be designed to explicitly state 30,000 as the
maximum:
<element name="Bookstore">
<complexType>
<sequence>
<element name="Book" type="..." maxOccurs="30000"/>
</sequence>
</complexType>
</element>
Viewpoint 1: Permit an Unbounded Number of Occurrences
This viewpoint says that it's better to use maxOccurs="unbounded".
There is a technical problem with setting maxOccurs="30000". Michael Kay
nicely summarizes the problem: "the classical algorithms for turning grammars
into finite state machines produce very inefficient machines when there are
occurrence limits that are large but finite. Many schema processors break or
consume seriously large amounts of memory if you specify a maxOccurs value
(other than unbounded) that's greater than a couple of hundred." In other words,
a Schema validator will choke if you specify maxOccurs="30000".
If the Bookstore wants, at a later date, to expand to accommodate, say,
35,000 Books then no change will be needed to the Schema.
The example being considered is just one element. A Schema is, of course,
usually comprised of many elements. Suppose that each element is constrained as
precisely as possible. Then a document may be rejected because one element
exceeded its constraints while all other others were within their
constraints.
Viewpoint 2: Constrain the Number of Occurrences
This viewpoint says that it's better to use maxOccurs="30000".
It is important to distinguish between theoretical constraints and
practical constraints. Theoretically, a Bookstore may have an unbounded
number of Books, but for performance/capacity/security reasons this Bookstore
can only handle 30,000 Books.
Another example of the difference between theoretical and practical limits:
theoretically an HTML document may have an infinite number of <p>
elements, but in practice a browser supports only a specific maximum.
Every system has practical constraints. There are many possible reasons for
the constraint such as performance, capacity, or security constraints. The
particulars are irrelevant. What is relevant, however, is that it is guaranteed
that no system is infinite and consequently all systems have practical
constraints.
Expressing the practical limits in an XML Schema are especially important for
Service Level Agreements (SLAs).
So, there are theoretical constraints and there are practical constraints.
And as we've seen, typically the two are not identical. The purpose of an XML
Schema is to express constraints. Which constraints should a Schema express -
theoretical constraints or practical constraints? Viewpoint 2 says that a Schema
should express the practical constraints. Thus, for the Bookstore example, set
maxOccurs="30000".
See Greg Hunt's messages for an excellent discussion of this viewpoint.
Viewpoint 1: Permit an Unbounded Number of Occurrences
-
Advantages
- Schema Validator Efficiency: Schema validators are more efficient
when you use maxOccurs="unbounded" than when you set maxOccurs to a large
number
- Accommodates Growth: as a system is expanded to support larger
amounts of data, there is no need to change the Schema.
-
Disadvantages
- Pushes the Constraint-Checking Problem to Another Part of the
System: if the practical limits are not expressed in the Schema, then
they have to be expressed somewhere else - somewhere less visible, less
maintainable, and with less tool support.
- Vulnerable to Denial of Service Attack: XML guards depend upon an
XML Schema to indicate whether an XML instance document should be allowed to
pass. It will be unable to detect a denial of service attack since the
Schema sets a theoretical limit and not a practical limit.
Viewpoint 2: Constrain the Number of Occurrences
-
Advantages
- Constraints are Visible, Maintainable, and with Tool Support:
since the practical limits are expressed in the Schema, they are visible,
maintainable, and with tool support.
- Prevent Denial of Service Attack: XML guards depend upon an XML
Schema to indicate whether an XML instance document should be allowed to
pass. It will be able to detect a denial of service attack since the Schema
sets a practical limit and not a theoretical limit.
Disadvantages
- Schema Validator is Inefficient: Schema validators are not
efficient when you set maxOccurs to a large number
- Requires Periodic Update: as a system is expanded to support
larger amounts of data, the Schema will need to be updated.
- Exposing System Limits: by setting maxOccurs="30000" you are
giving information to people about the limitations of your system.
Proposal: Constraining Data without the Validator Inefficiencies
As was described above the current implementation of Schema validators are
very inefficient when you specify a large number for the value of maxOccurs. So,
even if you want to express practical limits you cannot, due to this Schema
validator implementation problem.
Here is a proposal to avoid the inefficiency: express the theoretical limits
using XML Schemas and express the practical limits using Schematron assertions.
Let's consider the Bookstore example. Below I show that XML Schemas is to
indicate that Bookstore is comprised of an unbounded number of Books, and I use
Schematron to indicate that the practical limit to the number of Books must not
exceed 30,000:
<element name="Bookstore">
<complexType>
<sequence>
<element name="Book" type="..." maxOccurs="unbounded">
<schematron:assert test="count(Book) <= 30000"/>
</element>
</sequence>
</complexType>
</element>
Comments?