OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Most XML vocabularies are too large and inevitably have lots of"holes"
• From: "Costello, Roger L." <costello@mitre.org>
• To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
• Date: Sat, 17 Dec 2011 19:50:10 +0000

```Hi Folks,

Recently I have been learning Lambda Calculus [1].

A fascinating thing about Lambda Calculus is its richness, despite it being extraordinarily simple.

The set of expressions (lambda-terms) that can be created in Lambda Calculus is defined as follows:

a. All variables are lambda-terms

b. If M and N are any lambda-terms, the (M N) is a lambda-term (called an application)

c. If M is any lambda-term and x is any variable, then (\x -> M) is a lambda-term (called an abstraction)

Wow!

With just a few items and a few combination rules, an entire field was spawned.

Because it is limited it has been possible to formally characterize Lambda Calculus.

A few days ago Michael Kay made this startling statement regarding XML Schema

... the more you read the XSD spec, the more holes you find.

And on the xmlschema-dev list Michael Kay recently stated this

... the schema construction model is not defined very formally ...

1. XML Schema is a comparatively small XML vocabulary. I haven't counted the number of elements and attributes but let me guess that the total is 100 (probably less).

2. XML Schema is pretty rigorously specified.

Yet despite its smallness and fairly rigorous specification it still has "holes" in it.

ASSERTION: An XML vocabulary consisting of 100 items (or more) is too much. It can never be formally specified and it will forever have "holes."

Let's do a little math. Suppose an XML vocabulary consists of 5 elements -- A, B, C, D, E -- and one of them must be the root element which must contain only one child element. Here are some valid instances

<A>
<B>___</B>
</A>

<A>
<C>___</C>
</A>

<B>
<A>___</A>
</B>

And so forth.

With this extremely constrained XML vocabulary there are: 5 * 4 = 20 permutations (XML instances with differing arrangements of markup).

If we allow the root element to have one or two child elements then there are: 5 * 4  + 5 * 2**4 = 100 permutations.

The complexity grows at an breathtaking rate as the size of the vocabulary increases and as the ways of combining the vocabulary increases.

How will you possibly avoid "holes" in an XML vocabulary that has a complexity space that is in the trillions of trillions of trillions of permutations?

You can't.

ASSERTION: Large XML vocabularies must be avoided.

So, what's the solution?

The solution is to do what Lambda Calculus has done and what Simon Peyton-Jones has described in his article "How to write a financial contract". That is, create a small set of simple, well-specified primitives and a few combination rules.

So, how many primitives and how many combination rules?

Let me toss out a number: an XML vocabulary should not contain more than a dozen primitive elements and a handful of combination rules. That should be enough to generate all the richness one could possibly ever need. And you just might be able to formally specify your XML vocabulary and ensure that it has no "holes."

Clearly this is the only way to go for mission-critical applications.