xml-dev - Fallacies of Validation, version #3

Fallacies of Validation, version #3
[ Lists Home | Date Index | Thread Index ]
To: <xml-dev@lists.xml.org>
Subject: Fallacies of Validation, version #3
From: "Roger L. Costello" <costello@mitre.org>
Date: Fri, 27 Aug 2004 11:25:06 -0400
Thread-index: AcSMSgg9InsSfZLER364UjPLMoJC/g==
Hi Folks,

I have added one new fallacy: 

   Fallacy that Validation is Exclusively for Constraint Checking.  

For me, this fallacy has been, by far, the most mind-opening of all the
fallacies.  I hope that you will find the writeup useful, and look forward
to your comments.

Also, I have incorporated in the writeup of fallacy 4 the comments that Rick
Jelliffe raised yesterday.  After reading the writeup for fallacy 7 you will
understand the importance of Rick's comments.  It has deep ramifications on
what validation language you should use.  (Rick, please see my question for
you in the writeup of fallacy 7).

Preamble

The purpose of documenting the below "fallacies" is to identify erroneous
common thought that many people have with regards to validation and its role
in a system architecture.  Perhaps "assumptions" would be a better term to
use than "fallacies".  In any case, the desire of this writeup (which is a
compilation of discussions on the xml-dev list) is to provoke new ways of
thinking about validation, and reject limiting and static views on
validation.  
 
Fallacies of Validation

1. Fallacy of "THE Schema"

2. Fallacy of Schema Locality

3. Fallacy of Requisite Validation

4. Fallacy of Validation as a Pass/Fail Operation

5. Fallacy of a Universal Validation Language

6. Fallacy of Closed System Validation

7. Fallacy that Validation is Exclusively for Constraint Checking

Let's examine each of these fallacies.

1. Fallacy of "THE Schema"

This fallacy was identified by Michael Kay:

> ... there's no harm in using XML Schema to check data
> against the business rules, so long as you realize this
> is *an* XML Schema, not *the* XML Schema. We need to stop
> thinking that there can only be one schema.

Len Bullard made a similar statement:

> ... most fundamental errors are ... to consider only a single schema.

and at another point Len states:

> ... fall into the trap of thinking of THE schema and not
> recognizing the system as a declarative ecosystem of schemas
> and schema components.

Both Michael and Len are stating that in a system there should be numerous
schemas. This is a big mindshift for me. I admit being trapped into thinking
that there should be a single schema.

Len responded to my query to define "declarative ecosystem".  I think that
this term is a very important term and underlies much of what is presented
here. Here's what "declarative ecosystem" means:

Every system lives within a world where there is a lot of variety, i.e.,
systems aren't islands.  For example, the Wal-Mart system must coexist with
its supplier systems, its distributor systems, and its retailer systems.
One can think of this system-of-systems as an "ecosystem".  Thus, the
Wal-Mart system resides in an ecosystem.  Each system within the ecosystem
has their own local requirements which are documented by their own
(declarative-based) schemas.  Thus, not only are there a bunch of systems
which must coexist, there are a bunch of schemas that must coexist.  This
ecosystem of schemas is a "declarative ecosystem".  [Len, have I accurately
defined the term?]

Oh, one more comment on declarative ecosystems.  Len made this remark which
I think is important:

> ... [if two systems are interoperating in a
> closed environment then] it doesn't matter how
> singular or multiple they [the schemas] are;
> but when they are in an ecosystem, they typically
> overlap and exchange information, and adapt as a
> result.

Okay, now back to the fallacy of "THE schema" ...

Many examples were provided to demonstrate the value of multiple
validations:

Len provided an example of a distributed reporting system:

> Look at any large reporting system.  You can build
> that up a large schema but given local variations,
> do you have sufficient power/force/authority to
> make them stick or will you be constantly adjusting
> them, loosening them, strengthening them, and how
> will you know which is the right thing to so?

I would like to elaborate further on this.  Suppose that a company has an
office in London, Hong Kong, and Sydney.  They all report to the main office
in New York.  With such a geographically dispersed collection of offices, it
is easy to imagine that there will be local variations.  There will probably
be some data that is common to all the offices (Rick Jelliffe calls the
constraints on this type of data invariant constraints).  Then there will be
locale-specific data (variant constraints).  So, it doesn't seem reasonable
to assume that a single reporting schema would suffice for this
geographically-dispersed organization.  [Len, have I captured your example
accurately?]

Mary Holstege and Michael Kay gave examples of the value of multiple schemas
in a workflow environment:

From Mary Holstege:

> ... suppose all you care about in some phase of
> processing is picking up the IDs in a document.
> Then you define a minimal schema where everything
> is open with the appropriate ID attributes. Maybe
> you're going to generate an index. In another
> phase of processing all you care about is checking
> that dates are in the right date range. So you have
> another minimal schema that only pays attention to dates.

From Michael Kay:

> One example I am thinking of is where a document is
> gradually built up in the course of a workflow. At
> each stage in the workflow the validation constraints
> are different. You can think of each schema as a filter
> that allows the document to proceed to the next stage of
> processing.

Finally, Len made a good statement:

> Sometimes, a single schema suffices for the whole
> system.  Sometimes, you needs lots of little ones.

2. Fallacy of Schema Locality

Len identified this fallacy:

> ... most fundamental errors are to consider schemas only at the external
system junctions ...

What is being said is this: if you build a system with local customs
hardcoded into it, but then deploy it into a global environment ... that's a
real bad mistake. An example of this is Michael Kay's example of interacting
with an online U.S. service that insisted on users providing a state code.
Clearly, the online service was built with local customs hardcoded, but then
deployed in
a global environment.

Here's a comment that Len made on this fallacy:

> The problem of locale is that it is declared
> locally but might require global management.

3. Fallacy of Requisite Validation

Michael Kay made a very compelling statement with regards to
whether validation should be done at all in certain situations. Michael was
responding to the example of an online service validating a user's address.
Here's what Michael said about the online service's insistence on validating
the user's address:

> The strategy (validating the user's address) assumes that
> you know better than your customers what constitutes a
> valid address. Let's face it, you don't, and you never
> will. A much better strategy is to let them (the user) express
> their address in their own terms. After all, that's what they
> do in old-fashioned paper correspondence, and it seems
> to work quite well.

Michael argues very effectively that in this situation it makes no sense to
do any validation at all!

4. Fallacy of Validation as a Pass/Fail Operation

Mary Holstege identified this fallacy.  Here's what she said:

> [Many people think that validation is a pass/fail operation.]
> Not so, although lots of people are still stuck in that way
> of thinking, including, alas, a lot of the vendors.
> The schema design goes to great pains to make it possible to
> do things like this, for example: validate a document against
> a tight schema, and then ask questions of the result such as
> "show me all the item counts that failed validation because they
> were too high"

Rick Jelliffe notes that XML Schemas validators are limited with regards to
providing useful information on where an error occurred.  In fact, he argues
that oftentimes an XML Schema validator will provide wrong information about
the location of an error.  He notes that this problem is not specific to XML
Schema validators, but to all grammar-based validators (e.g., XML Schemas
and RelaxNG).  Rick notes that with Schematron (which is not grammar-based)
you can associate specific error messages with each assertion.  Here's an
example that Rick provided:

<sch:rule context="beatles/member">
  <sch:assert test="count(../member)=4" flag="tooManyBeatles"
     diagnostics="tmb">
     The beatles should have four members
  </sch:assert>
<sch:rule>
..
<sch:diagnostic id="tmb">
  Check that <sch:value-of select="."/> is a correct Beatle
</sch:diagnostic>

If the number of beatles/member elements is not equal to 4 then a specific,
user-defined error message is spawned.  This is very nice!

NOTE: This issue that Rick is raising has a big impact on fallacy #7 (useful
messages are supreme importance) 

5. Fallacy of a Universal Validation Language

Dave Pawson identified this fallacy.  He noted that the Atom specification
cannot be validated using a single technology:

> From [Atom, version] 0.3 onwards it's not been possible
> to validate an instance against a single schema, not
> even Relax NG. They need a mix of Schema and 'other'
> processing before being given a clean bill of health.

6. Fallacy of Closed System Validation

This fallacy was identified by Len a long time ago.  I still remember
something he said one day when discussing closed versus open systems,
"Systems leak.  There's no such thing as a closed system".  This is an
important comment.  Many people imagine that they can create a monolithic,
invariant schema because "there's just me and my well-known trading
partners".  This statement fails to recognize the existence of a changing
world; more precisely, a changing ecosystem.

7. Fallacy that Validation is Exclusively for Constraint Checking

I suspect that many people have the same mentality that I have regarding
validation: "An XML instance document arrives, I forward it to a validator
tool, if the validator tool doesn't complain then forward the instance to
some software to process it.  If there's an error then discard the
instance."  Len has enlightened me to the greater role that validation can
play in a system, which I discuss below:

Launching Point for Messages

While validating an instance document it is reasonable to generate messages
- "error messages" when errors are encountered, and even "success messages"
when instance data is found to be conformant.  Thus, validation can result
in spawning messages that are sent around the system, which activate other
parts of the system.  Where do the messages go?  What parts of the system
receive the message? One possibility is subscription - a part of the system
will receive a message only if it has subscribed to receive that type of
message.  Here are some snippets of a message from Len on this use of
validation:

> ... if the expectation of the contract is 
> violated, a flag goes up and is sent 
> to whoever has subscribed to that event 
> type.

And this snippet:

> Aka, event-driven intelligence: a flag is raised 
> given recognition of a pattern/error/trend and 
> the system sends a subscriber(s) a message.

A question for Rick Jelliffe: I know that Schematron can spawn a message
when an assertion fails (i.e., when the data is found to be erroneous).  Can
Schematron spawn a message to inform that a chunk of data is valid?

Feedback-Mediated Evolution

Suppose that during validation an exception is raised.  As discussed above,
this may result in spawning a message to some part of the system. (For
example, a user enters an invalid value for a U.S. state code, which results
in sending a message to a logger routine)  Len notes, "an exception is not
an error, it's a learning (feedback) mechanism".  The recipient of the
message can take advantage of the information that the message provides.
(For example, the recipient of the invalid state code message may realize
that the system should not be forcing non-U.S. users to enter a state code;
the recipient then changes the system)  Thus, validation messages become
valuable feedback, which may be used to facilitate evolution of the system.


Darwinian Selection Process

In Darwinian evolution less fit species are filtered out.  Only the fittest
species survive.  You may view validation as a process in which less fit
(erroneous) instances are filtered out, and only the fittest (conforming)
instances survive.  Here's a snippet from Len on this:

> ... the model of selection based on fitness 
> or other criteria can be used to direct 
> the evolution of the system based on feedback 
> in the form of messages.

...

Finally, my favorite term of the day, and my favorite quote of the day.

Favorite Term: Feedback-mediated evolution

Favorite Quote:

[From Len] "A schema knows how to do one thing: tell you if the message
conforms or doesn't to its expectations.  Where and when you use this in a
system or system of systems is entirely up to you.  XML Doesn't Care. It
isn't and can't be imaginative."

Thanks again everyone!  Please keep the comments coming.

/Roger
Prev by Date: XML Validation in Fortran
Next by Date: Re: [xml-dev] XML Validation in Fortran
Previous by thread: XML Validation in Fortran
Next by thread: Shreddin The Natch
Index(es):
- Date
- Thread