[
Lists Home |
Date Index |
Thread Index
]
- To: "Roger L. Costello" <costello@mitre.org>,"XML Developers List" <xml-dev@lists.xml.org>
- Subject: RE: [xml-dev] [Summary] Constrain the Number of Occurrences of Elements in your XML Schema
- From: "Derek Denny-Brown" <derekdb@microsoft.com>
- Date: Mon, 8 Aug 2005 13:09:02 -0700
- Cc: "Rick Jelliffe" <ricko@allette.com.au>
- Thread-index: AcWb9bs0l5YzZnkCRVGW4BeMik74qwAGcawgABCwJDA=
- Thread-topic: [xml-dev] [Summary] Constrain the Number of Occurrences of Elements in your XML Schema
Your arguments for why it is good to provide explicit upper bounds seem to boil down to making resource constraints explicit. The problem is that resource constraints have always been extremely multi-faceted in my experience, and every attempt to provide bounds on secondary entities is dangerous, as it is not robust. In your bookstore example, they may only be able to handle 30,000 books _on_average_, but since the actual measure depends on book size, local weather, staffing, etc... using the book count as the measure only provides a false sense of security. Similarly, in your example about <p> tags in HTML, the ultimate resource is memory, and the formula for computing how much memory an HTML document uses is far more complicated than counting <p> tags.
I've worked with a lot of customers on understanding denial-of-service issues with XML. I have never seen a situation where max-occurs was a valid solution. Every time, max-occurs only pushes the attack to a new vector. If you are trying to protect memory, then your parser (or it's host environment) should provide means to bound CPU/Memory usage. ASP.Net provides configurable bounds for both CPU(time) and memory, and Windows 'Jobs' also provide ways to bound processes. I believe Solaris has similar functionality for bounding processes.
Pushing the logic for resource constraints into the schema has the additional problem that every back-end upgrade changes the schema! If this really was a bound that everyone _needed_ to know about, then it should be in the schema, but in all your examples, most users don't care what the constraint is, because they will never hit it. You will now be versioning a schema for reasons that most customers are uninterested in? That doesn't sound good.
-derek
[note: the forthcoming releases of MSXML & System.Xml should handle large-but-finite min/max-occurs bounds without issue, although it will incur a small overhead when compared to max-occurs = unbounded]
________________________________________
From: Roger L. Costello [mailto:costello@mitre.org]
Sent: Monday, August 08, 2005 4:58 AM
To: 'XML Developers List'
Cc: 'Rick Jelliffe'
Subject: [xml-dev] [Summary] Constrain the Number of Occurrences of Elements in your XML Schema
Hi Folks,
Outstanding discussion! Many thanks for all the comments. I think that this is an important issue. Below I have tried to summarize the discussion (it doesn't include the most recent comments). Also, at the bottom of this message I have a proposal for getting the best of both viewpoints by using a combination of XML Schemas and Schematron. /Roger
Issue
Should unbounded occurrences be permitted in an XML Schema?
Two Approaches to Allowing an Unbounded Number of Occurrences
There are two approaches in XML Schemas for allowing an unbounded number of occurrences. Below I discuss these two approaches. Following that I discuss two viewpoints on whether unbounded number of occurrences should or should not be permitted. I then discuss the advantages and disadvantages of each viewpoint. Finally, I propose a compromise of the two viewpoints.
Allowing Unbounded Occurrences using maxOccurs
The first approach to allowing an unbounded number of occurrences is to explicitly state that you want an unbounded number of occurrences by using maxOccurs="unbounded". For example, the following declaration says that Bookstore can contain an unbounded number of Book elements:
<element name="Bookstore">
<complexType>
<sequence>
<element name="Book" type="..." maxOccurs="unbounded"/>
</sequence>
</complexType>
</element>
Allowing Unbounded Occurrences using a Recursive Expression
The second approach to allowing an unbounded number of occurrences is less obvious. Unboundedness occurs implicitly when you create a recursive structure. In the following example there is no limit to the depth of the Section elements. That is, a Section can contain a Section which contains a Section which contains a Section ...
<element name="Section" type="SectionType"/>
<complexType name="SectionType">
<sequence>
<element name="Title" type="..."/>
<element name="Section" type="SectionType"/>
</sequence>
</complexType>
Comparison of the Two Approaches
Both of the above approaches allow an unbounded number of occurrences. Let's compare the two approaches:
1. Explicit vs Implicit: With the first approach you explicitly state that you are allowing an unbounded number of occurrences. With the second approach unboundedness is implicit. Although it is obvious in the example presented, in a large Schema where the recursion extends through many complexTypes it may not be obvious that an unbounded number of occurrences are being allowed.
2. Ability to "throttle back" on the Number of Occurrences: With the first approach is it easy to reduce the number of occurrences. If you don't want an unbounded number of occurrences, and want, say, only 10 occurrences then you simply specify maxOccurs="10". With the second approach there is no means to control the depth of the recursion. There is no means to say, "There cannot be more than 10 deep Section elements".
Permit Unbounded Occurrences or Constrain the Occurrences?
Now that we have seen the two approaches for allowing an unbounded number of occurrences we return to the main issue: when designing an XML Schema should you permit an unbounded number of occurrences? There are two viewpoints on this issue. One viewpoint is that you should design your Schemas to permit an unbounded occurrences. The other viewpoint is that you should not permit an unbounded number of occurrences.
To keep the discussion concrete, let's take the above Bookstore Schema as the example. Suppose that it is decided that Bookstore should not allow more than 30,000 Books. Should the Schema be designed to allow an unbounded number of Books:
<element name="Bookstore">
<complexType>
<sequence>
<element name="Book" type="..." maxOccurs="unbounded"/>
</sequence>
</complexType>
</element>
Or, should the Schema be designed to explicitly state 30,000 as the maximum:
<element name="Bookstore">
<complexType>
<sequence>
<element name="Book" type="..." maxOccurs="30000"/>
</sequence>
</complexType>
</element>
Viewpoint 1: Permit an Unbounded Number of Occurrences
This viewpoint says that it's better to use maxOccurs="unbounded".
There is a technical problem with setting maxOccurs="30000". Michael Kay nicely summarizes the problem: "the classical algorithms for turning grammars into finite state machines produce very inefficient machines when there are occurrence limits that are large but finite. Many schema processors break or consume seriously large amounts of memory if you specify a maxOccurs value (other than unbounded) that's greater than a couple of hundred." In other words, a Schema validator will choke if you specify maxOccurs="30000".
If the Bookstore wants, at a later date, to expand to accommodate, say, 35,000 Books then no change will be needed to the Schema.
The example being considered is just one element. A Schema is, of course, usually comprised of many elements. Suppose that each element is constrained as precisely as possible. Then a document may be rejected because one element exceeded its constraints while all other others were within their constraints.
Viewpoint 2: Constrain the Number of Occurrences
This viewpoint says that it's better to use maxOccurs="30000".
It is important to distinguish between theoretical constraints and practical constraints. Theoretically, a Bookstore may have an unbounded number of Books, but for performance/capacity/security reasons this Bookstore can only handle 30,000 Books.
Another example of the difference between theoretical and practical limits: theoretically an HTML document may have an infinite number of <p> elements, but in practice a browser supports only a specific maximum.
Every system has practical constraints. There are many possible reasons for the constraint such as performance, capacity, or security constraints. The particulars are irrelevant. What is relevant, however, is that it is guaranteed that no system is infinite and consequently all systems have practical constraints.
Expressing the practical limits in an XML Schema are especially important for Service Level Agreements (SLAs).
So, there are theoretical constraints and there are practical constraints. And as we've seen, typically the two are not identical. The purpose of an XML Schema is to express constraints. Which constraints should a Schema express - theoretical constraints or practical constraints? Viewpoint 2 says that a Schema should express the practical constraints. Thus, for the Bookstore example, set maxOccurs="30000".
See Greg Hunt's messages for an excellent discussion of this viewpoint.
Viewpoint 1: Permit an Unbounded Number of Occurrences
* Advantages
o Schema Validator Efficiency: Schema validators are more efficient when you use maxOccurs="unbounded" than when you set maxOccurs to a large number
o Accommodates Growth: as a system is expanded to support larger amounts of data, there is no need to change the Schema.
* Disadvantages
o Pushes the Constraint-Checking Problem to Another Part of the System: if the practical limits are not expressed in the Schema, then they have to be expressed somewhere else - somewhere less visible, less maintainable, and with less tool support.
o Vulnerable to Denial of Service Attack: XML guards depend upon an XML Schema to indicate whether an XML instance document should be allowed to pass. It will be unable to detect a denial of service attack since the Schema sets a theoretical limit and not a practical limit.
Viewpoint 2: Constrain the Number of Occurrences
* Advantages
o Constraints are Visible, Maintainable, and with Tool Support: since the practical limits are expressed in the Schema, they are visible, maintainable, and with tool support.
o Prevent Denial of Service Attack: XML guards depend upon an XML Schema to indicate whether an XML instance document should be allowed to pass. It will be able to detect a denial of service attack since the Schema sets a practical limit and not a theoretical limit.
* Disadvantages
o Schema Validator is Inefficient: Schema validators are not efficient when you set maxOccurs to a large number
o Requires Periodic Update: as a system is expanded to support larger amounts of data, the Schema will need to be updated.
o Exposing System Limits: by setting maxOccurs="30000" you are giving information to people about the limitations of your system.
Proposal: Constraining Data without the Validator Inefficiencies
As was described above the current implementation of Schema validators are very inefficient when you specify a large number for the value of maxOccurs. So, even if you want to express practical limits you cannot, due to this Schema validator implementation problem.
Here is a proposal to avoid the inefficiency: express the theoretical limits using XML Schemas and express the practical limits using Schematron assertions. Let's consider the Bookstore example. Below I show that XML Schemas is to indicate that Bookstore is comprised of an unbounded number of Books, and I use Schematron to indicate that the practical limit to the number of Books must not exceed 30,000:
<element name="Bookstore">
<complexType>
<sequence>
<element name="Book" type="..." maxOccurs="unbounded">
<schematron:assert test="count(Book) <= 30000"/>
</element>
</sequence>
</complexType>
</element>
Comments?
|