xml-dev - [Summary] Fallacies of Validation

[Summary] Fallacies of Validation

[ Lists Home | Date Index | Thread Index ]

To: <xml-dev@lists.xml.org>
Subject: [Summary] Fallacies of Validation
From: "Roger L. Costello" <costello@mitre.org>
Date: Sat, 4 Sep 2004 16:15:15 -0400
Thread-index: AcSSu+QMSgbYODjkS8mnr975A9tALA==

Fallacies of Validation

Introduction

The purpose of this document is to identify common "fallacies" with regards to validation and its role in a system architecture. These fallacies were identified through discussions on the xml-dev list. This is a record of those discussions.

Fallacies of Validation

1. Fallacy of "THE Schema"

2. Fallacy of Schema Locality

3. Fallacy of Requisite Validation

4. Fallacy of Validation as a Pass/Fail Operation

5. Fallacy of a Universal Validation Language

6. Fallacy of Closed System Validation

7. Fallacy that Validation is Exclusively for Constraint Checking

Each of these fallacies is examined below.

1. Fallacy of "THE Schema"

This fallacy was identified by Michael Kay:

... there's no harm in using XML Schema to check data against the business rules, so long as you realize this is *an* XML Schema, not *the* XML Schema. We need to stop thinking that there can only be one schema.

Len Bullard made a similar statement:

... most fundamental errors are ... to consider only a single schema.

and at another point Len states:

... fall into the trap of thinking of THE schema and not recognizing the system as a declarative ecosystem of schemas and schema components.

Both Michael and Len are stating that in a system there should be numerous schemas.

Sidebar

Len was asked to define "declarative ecosystem". This is a very important term and underlies much of what is presented
here. Here's what "declarative ecosystem" means:

Every system lives within a world where there is a lot of variety, i.e., systems aren't islands. For example, the Wal-Mart system must coexist with its supplier systems, its distributor systems, and its retailer systems. One can think of this system-of-systems as an "ecosystem". Thus, the Wal-Mart system resides in an ecosystem. Each system within the ecosystem has their own local requirements which are documented by their own (declarative-based) schemas. Thus, not only are there a bunch of systems which must coexist, there are a bunch of schemas that must coexist. This ecosystem of schemas is a "declarative ecosystem".

One more comment on declarative ecosystems. Len made this remark which is important:

... [if two systems are interoperating in a closed environment then] it doesn't matter how singular or multiple they [the schemas] are; but when they are in an ecosystem, they typically overlap and exchange information, and adapt as a result.

Now back to the fallacy of "THE schema" ...

Many examples were provided to demonstrate the value of multiple validations:

Len provided an example of a distributed reporting system:

Look at any large reporting system. You can build that up a large schema but given local variations, do you have sufficient power/force/authority to make them stick or will you be constantly adjusting them, loosening them, strengthening them, and how will you know which is the right thing to so?

Here is further elaboration on this. Suppose that a company has an office in London, Hong Kong, and Sydney. They all report to the main office in New York. With such a geographically dispersed collection of offices, it is easy to imagine that there will be local variations. There will probably be some data that is common to all the offices (Rick Jelliffe calls the constraints on this type of data invariant constraints). Then there will be locale-specific data (variant constraints). So, it doesn't seem reasonable to assume that a single reporting schema would suffice for this geographically-dispersed organization.

Mary Holstege and Michael Kay gave examples of the value of multiple schemas in a workflow environment:

From Mary Holstege:

... suppose all you care about in some phase of processing is picking up the IDs in a document. Then you define a minimal schema where everything is open with the appropriate ID attributes. Maybe you're going to generate an index. In another
phase of processing all you care about is checking that dates are in the right date range. So you have another minimal schema that only pays attention to dates.

From Michael Kay:

One example I am thinking of is where a document is gradually built up in the course of a workflow. At each stage in the workflow the validation constraints are different. You can think of each schema as a filter that allows the document to proceed to the next stage of processing.

Finally, Len made a good statement:

Sometimes, a single schema suffices for the whole system. Sometimes, you needs lots of little ones.

2. Fallacy of Schema Locality

Len identified this fallacy:

... most fundamental errors are to consider schemas only at the external system junctions ...

What is being said is this: if you build a system with local customs hardcoded into it, but then deploy it into a global environment ... that's a real bad mistake. An example of this is Michael Kay's example of interacting with an online U.S. service that insisted on users providing a state code. Clearly, the online service was built with local customs hardcoded, but then deployed in a global environment.

Here's a comment that Len made on this fallacy:

The problem of locale is that it is declared locally but might require global management.

3. Fallacy of Requisite Validation

Michael Kay made a very compelling statement with regards to whether validation should be done at all in certain situations. Michael was responding to the example of an online service validating a user's address. Here's what Michael said about the online service's insistence on validating the user's address:

The strategy (validating the user's address) assumes that you know better than your customers what constitutes a valid address. Let's face it, you don't, and you never will. A much better strategy is to let them (the user) express their address in their own terms. After all, that's what they do in old-fashioned paper correspondence, and it seems to work quite well.

Michael argues very effectively that in this situation it makes no sense to do any validation at all!

Jonathan Robie rebutted Michael's argument, saying that validation is necessary for machine processing:

In old-fashioned paper correspondence, addresses are interpreted by human beings, and this is a perfectly fine strategy in an application that formats addresses so that they can be read by human beings. But if I have a program that needs to be able to identify customers in a given region, or that needs to be able to compute the shipping costs before sending an item, then my program needs to know how to read the address. I'm not asking the customer to provide an address in a format that they might recognize, I'm asking the customer to provide an address in a format that my program can use. In that context, even if the customer finds it a little painful, I'm going to make them communicate at least the basic information.

For addresses, many applications have a certain middle ground. They insist on knowing the country and postal code, and perhaps street name and number, but allow other information to be added in a way that the program might not recognize. One more useful application of partial understanding.

Then Frank Manola rebutted Jonathan, emphasizing that oftentimes constraints are unknowable:

Part of the problem, though, is when the people defining the constraints think they know the requirements of actually performing the activity the program is supposed to help implement, but really don't; or in the example you cite, think the address constraints they define will actually help deliver the goods, but they actually get in the way. I experience this problem quite frequently. My street "number" (some other street "numbers" in our neighborhood do too) has a letter in it: 50A Butters Row (don't ask me why: I'm not responsible for how addresses are assigned here). Sometimes a program accepting addresses won't allow me to enter the letter (or the letter magically becomes an apartment number, which it isn't; this is a single-family house), because the writers of the constraints think they know how addresses are supposed to look. Not having a street number that matches the actual address of the house doesn't help delivery very much.

4. Fallacy of Validation as a Pass/Fail Operation

Mary Holstege identified this fallacy. Here's what she said:

[Many people think that validation is a pass/fail operation.] Not so, although lots of people are still stuck in that way of thinking, including, alas, a lot of the vendors. The schema design goes to great pains to make it possible to do things like this, for example: validate a document against a tight schema, and then ask questions of the result such as "show me all the item counts that failed validation because they were too high"

Rick Jelliffe notes that XML Schemas validators are limited with regards to providing useful information on where an error occurred. In fact, he argues that oftentimes an XML Schema validator will provide wrong information about the location of an error. He notes that this problem is not specific to XML Schema validators, but to all grammar-based validators (e.g., XML Schemas and RelaxNG). Rick notes that with Schematron (which is not grammar-based) you can associate specific error messages with each assertion. Here's an example that Rick provided:

<sch:rule context="beatles/member">
<sch:assert test="count(../member)=4" flag="tooManyBeatles" diagnostics="tmb">
The beatles should have four members
</sch:assert>
<sch:rule>
..
<sch:diagnostic id="tmb">
Check that <sch:value-of select="."/> is a correct Beatle
</sch:diagnostic>

If the number of beatles/member elements is not equal to 4 then a specific, user-defined error message is spawned. This is very nice!

The above example showed generating a user-defined message when an error occurs in the data. Rick also notes that Schematron has the ability to generate messages when it is detected that the data is accurate.

The ability to specify user-defined messages is a very important feature, as will be seen when fallacy #7 is examined.

5. Fallacy of a Universal Validation Language

Dave Pawson identified this fallacy. He noted that the Atom specification cannot be validated using a single technology:

From [Atom, version] 0.3 onwards it's not been possible to validate an instance against a single schema, not even Relax NG. They need a mix of Schema and 'other' processing before being given a clean bill of health.

6. Fallacy of Closed System Validation

This fallacy was identified by Len a long time ago. He was discussing closed versus open systems when he stated, "Systems leak. There's no such thing as a closed system". This is an important comment. Many people imagine that they can create a monolithic, invariant schema because "there's just me and my well-known trading partners". This statement fails to recognize the existence of a changing world; more precisely, a changing ecosystem.

7. Fallacy that Validation is Exclusively for Constraint Checking

I suspect that many people have the same mentality that I had regarding validation: "An XML instance document arrives, I forward it to a validator tool, if the validator tool doesn't complain then I forward the instance to some software to process it. If there's an error then discard the instance." Len has enlightened me to the greater role that validation can
play in a system. This is discussed below:

Launching Point for Messages

While validating an instance document it is reasonable to generate messages - "error messages" when errors are encountered, and even "success messages" when instance data is found to be conformant. Thus, validation can result in spawning messages that are sent around the system, which activate other parts of the system. Where do the messages go? What parts of the system receive the message? One approach is subscription - a part of the system will receive a message only if it has subscribed to receive that type of message. Here are some snippets of a message from Len on this use of validation:

... if the expectation of the contract is violated, a flag goes up and is sent to whoever has subscribed to that event type.

And this snippet:

Aka, event-driven intelligence: a flag is raised given recognition of a pattern/error/trend and the system sends a subscriber(s) a message.

Feedback-Mediated Evolution

Suppose that during validation an exception is raised. As discussed above, this may result in spawning a message to some part of the system. (For example, a user enters an invalid value for a U.S. state code, which results in sending a message to a logger routine) Len notes, "an exception is not an error, it's a learning (feedback) mechanism". The recipient of the
message can take advantage of the information that the message provides. (For example, the recipient of the invalid state code message may realize that the system should not be forcing non-U.S. users to enter a state code; the recipient then changes the system) Thus, validation messages become valuable feedback, which may be used to facilitate evolution of the system.

Darwinian Selection Process

In Darwinian evolution less fit species are filtered out. Only the fittest species survive. You may view validation as a process in which less fit (erroneous) instances are filtered out, and only the fittest (conforming) instances survive. Here's a snippet from Len on this:

... the model of selection based on fitness or other criteria can be used to direct the evolution of the system based on feedback in the form of messages.

Prev by Date: Re: [xml-dev] are native XML databases needed?
Next by Date: ANN: tkxmllint & tkxsltproc Version 1.6
Previous by thread: Re: [xml-dev] are native XML databases needed?
Next by thread: ANN: tkxmllint & tkxsltproc Version 1.6
Index(es):
- Date
- Thread