Issue
Should unbounded occurrences be permitted in an XML Schema?
Two Approaches to Allowing an Unbounded Number of Occurrences
There are two approaches in XML Schemas for allowing an unbounded
number of occurrences. Below I discuss these two approaches. Following
that I discuss two viewpoints on whether unbounded number of
occurrences should or should not be permitted. I then discuss the
advantages and disadvantages of each viewpoint. Finally, I propose a
compromise of the two viewpoints.
Allowing Unbounded Occurrences using maxOccurs
The first approach to allowing an unbounded number of occurrences is to
explicitly state that you want an unbounded number of occurrences by
using maxOccurs="unbounded". For example, the following declaration
says that Bookstore can contain an unbounded number of Book elements:
<element name="Bookstore">
<complexType>
<sequence>
<element name="Book" type="..." maxOccurs="unbounded"/>
</sequence>
</complexType>
</element>
Allowing Unbounded Occurrences using a Recursive Expression
The second approach to allowing an unbounded number of occurrences
is less obvious. Unboundedness occurs implicitly when you create a
recursive structure. In the following example there is no limit to the
depth of the Section elements. That is, a Section can contain a Section
which contains a Section which contains a Section ...
<element name="Section" type="SectionType"/>
<complexType name="SectionType">
<sequence>
<element name="Title" type="..."/>
<element name="Section" type="SectionType"/>
</sequence>
</complexType>
Comparison of the Two Approaches
Both of the above approaches allow an unbounded number of
occurrences. Let's compare the two approaches:
- Explicit vs Implicit: With the first approach you
explicitly state that you are allowing an unbounded number of
occurrences. With the second approach unboundedness is implicit.
Although it is obvious in the example presented, in a large Schema
where the recursion extends through many complexTypes it may not be
obvious that an unbounded number of occurrences are being allowed.
- Ability to "throttle back" on the Number of Occurrences:
With the first approach is it easy to reduce the number of occurrences.
If you don't want an unbounded number of occurrences, and want, say,
only 10 occurrences then you simply specify maxOccurs="10". With the
second approach there is no means to control the depth of the
recursion. There is no means to say, "There cannot be more than 10 deep
Section elements".
Permit Unbounded Occurrences or Constrain the Occurrences?
Now that we have seen the two approaches for allowing an unbounded
number of occurrences we return to the main issue: when designing an
XML Schema should you permit an unbounded number of occurrences? There
are two viewpoints on this issue. One viewpoint is that you should
design your Schemas to permit an unbounded occurrences. The other
viewpoint is that you should not permit an unbounded number of
occurrences.
To keep the discussion concrete, let's take the above Bookstore
Schema as the example. Suppose that it is decided that Bookstore should
not allow more than 30,000 Books. Should the Schema be designed to
allow an unbounded number of Books:
<element name="Bookstore">
<complexType>
<sequence>
<element name="Book" type="..." maxOccurs="unbounded"/>
</sequence>
</complexType>
</element>
Or, should the Schema be designed to explicitly state 30,000 as
the maximum:
<element name="Bookstore">
<complexType>
<sequence>
<element name="Book" type="..." maxOccurs="30000"/>
</sequence>
</complexType>
</element>
Viewpoint 1: Permit an Unbounded Number of Occurrences
This viewpoint says that it's better to use maxOccurs="unbounded".
There is a technical problem with setting maxOccurs="30000".
Michael Kay nicely summarizes the problem: "the classical algorithms
for turning grammars into finite state machines produce very
inefficient machines when there are occurrence limits that are large
but finite. Many schema processors break or consume seriously large
amounts of memory if you specify a maxOccurs value (other than
unbounded) that's greater than a couple of hundred." In other words, a
Schema validator will choke if you specify maxOccurs="30000".
If the Bookstore wants, at a later date, to expand to accommodate,
say, 35,000 Books then no change will be needed to the Schema.
The example being considered is just one element. A Schema is, of
course, usually comprised of many elements. Suppose that each element
is constrained as precisely as possible. Then a document may be
rejected because one element exceeded its constraints while all other
others were within their constraints.
Viewpoint 2: Constrain the Number of Occurrences
This viewpoint says that it's better to use maxOccurs="30000".
It is important to distinguish between theoretical constraints
and practical constraints. Theoretically, a Bookstore may have
an unbounded number of Books, but for performance/capacity/security
reasons this Bookstore can only handle 30,000 Books.
Another example of the difference between theoretical and
practical limits: theoretically an HTML document may have an infinite
number of <p> elements, but in practice a browser supports only a
specific maximum.
Every system has practical constraints. There are many possible
reasons for the constraint such as performance, capacity, or security
constraints. The particulars are irrelevant. What is relevant, however,
is that it is guaranteed that no system is infinite and consequently
all systems have practical constraints.
Expressing the practical limits in an XML Schema are especially
important for Service Level Agreements (SLAs).
So, there are theoretical constraints and there are practical
constraints. And as we've seen, typically the two are not identical.
The purpose of an XML Schema is to express constraints. Which
constraints should a Schema express - theoretical constraints or
practical constraints? Viewpoint 2 says that a Schema should express
the practical constraints. Thus, for the Bookstore example, set
maxOccurs="30000".
See Greg Hunt's messages for an excellent discussion of this
viewpoint.
Viewpoint 1: Permit an Unbounded Number of Occurrences
-
Advantages
- Schema Validator Efficiency: Schema validators are
more efficient when you use maxOccurs="unbounded" than when you set
maxOccurs to a large number
- Accommodates Growth: as a system is expanded to
support larger amounts of data, there is no need to change the Schema.
-
Disadvantages
- Pushes the Constraint-Checking Problem to Another Part
of the System: if the practical limits are not expressed in the
Schema, then they have to be expressed somewhere else - somewhere less
visible, less maintainable, and with less tool support.
- Vulnerable to Denial of Service Attack: XML guards
depend upon an XML Schema to indicate whether an XML instance document
should be allowed to pass. It will be unable to detect a denial of
service attack since the Schema sets a theoretical limit and not a
practical limit.
Viewpoint 2: Constrain the Number of Occurrences
-
Advantages
- Constraints are Visible, Maintainable, and with Tool
Support: since the practical limits are expressed in the Schema,
they are visible, maintainable, and with tool support.
- Prevent Denial of Service Attack: XML guards depend
upon an XML Schema to indicate whether an XML instance document should
be allowed to pass. It will be able to detect a denial of service
attack since the Schema sets a practical limit and not a theoretical
limit.
-
Disadvantages
- Schema Validator is Inefficient: Schema validators
are not efficient when you set maxOccurs to a large number
- Requires Periodic Update: as a system is expanded to
support larger amounts of data, the Schema will need to be updated.
- Exposing System Limits: by setting maxOccurs="30000"
you are giving information to people about the limitations of your
system.
Proposal: Constraining Data without the Validator Inefficiencies
As was described above the current implementation of Schema
validators are very inefficient when you specify a large number for the
value of maxOccurs. So, even if you want to express practical limits
you cannot, due to this Schema validator implementation problem.
Here is a proposal to avoid the inefficiency: express the
theoretical limits using XML Schemas and express the practical limits
using Schematron assertions. Let's consider the Bookstore example.
Below I show that XML Schemas is to indicate that Bookstore is
comprised of an unbounded number of Books, and I use Schematron to
indicate that the practical limit to the number of Books must not
exceed 30,000:
<element name="Bookstore">
<complexType>
<sequence>
<element name="Book" type="..." maxOccurs="unbounded">
<schematron:assert test="count(Book) <= 30000"/>
</element>
</sequence>
</complexType>
</element>
Comments?