1. Fallacy of "THE Schema"
This
fallacy was identified by Michael Kay:
... there's no
harm in using XML Schema to check data against the business rules, so long as
you realize this is *an* XML Schema, not *the* XML Schema. We need to stop
thinking that there can only be one schema.
Len Bullard made a
similar statement:
... most fundamental errors are
... to consider only a single schema.
and at another point Len
states:
... fall into the trap of thinking of
THE schema and not recognizing the system as a declarative ecosystem of schemas
and schema components.
Both Michael and Len are stating that in a
system there should be numerous schemas.
Sidebar
Len was asked to define "declarative
ecosystem". This is a very important term and underlies much of what is
presented
here. Here's what "declarative ecosystem" means:
Every
system lives within a world where there is a lot of variety, i.e., systems
aren't islands. For example, the Wal-Mart system must coexist with its
supplier systems, its distributor systems, and its retailer systems. One can
think of this system-of-systems as an "ecosystem". Thus, the Wal-Mart
system resides in an ecosystem. Each system within the ecosystem has their
own local requirements which are documented by their own (declarative-based)
schemas. Thus, not only are there a bunch of systems which must coexist,
there are a bunch of schemas that must coexist. This ecosystem of schemas
is a "declarative ecosystem".
One more comment on declarative
ecosystems. Len made this remark which is important:
... [if two
systems are interoperating in a closed environment then] it doesn't matter how
singular or multiple they [the schemas] are; but when they are in an ecosystem,
they typically overlap and exchange information, and adapt as a
result.
Now back to the fallacy of "THE schema"
...
Many examples were provided to demonstrate the value of multiple
validations:
Len provided an example of a distributed reporting
system:
Look at any large reporting system. You
can build that up a large schema but given local variations, do you have
sufficient power/force/authority to make them stick or will you be constantly
adjusting them, loosening them, strengthening them, and how will you know which
is the right thing to so?
Here is further elaboration on
this. Suppose that a company has an office in London, Hong Kong, and
Sydney. They all report to the main office in New York. With such a
geographically dispersed collection of offices, it is easy to imagine that there
will be local variations. There will probably be some data that is common
to all the offices (Rick Jelliffe calls the constraints on this type of data
invariant constraints). Then there will be locale-specific data (variant
constraints). So, it doesn't seem reasonable to assume that a single
reporting schema would suffice for this geographically-dispersed
organization.
Mary Holstege and Michael Kay gave examples of the value of multiple schemas
in a workflow environment:
From Mary Holstege:
... suppose all you care about in some phase of processing is
picking up the IDs in a document. Then you define a minimal schema where
everything is open with the appropriate ID attributes. Maybe you're going to
generate an index. In another
phase of processing all you care about is
checking that dates are in the right date range. So you have another minimal
schema that only pays attention to dates.
From Michael
Kay:
One example I am thinking of is where a document
is gradually built up in the course of a workflow. At each stage in the workflow
the validation constraints are different. You can think of each schema as a
filter that allows the document to proceed to the next stage of
processing.
Finally, Len made a good statement:
Sometimes, a single schema suffices for the whole system.
Sometimes, you needs lots of little ones.
2.
Fallacy of Schema Locality
Len identified this
fallacy:
... most fundamental errors are to consider
schemas only at the external system junctions ...
What is being
said is this: if you build a system with local customs hardcoded into it, but
then deploy it into a global environment ... that's a real bad mistake. An
example of this is Michael Kay's example of interacting with an online U.S.
service that insisted on users providing a state code. Clearly, the online
service was built with local customs hardcoded, but then deployed in a global
environment.
Here's a comment that Len made on this fallacy:
The problem of locale is that it is declared locally but might
require global management.
3. Fallacy of
Requisite Validation
Michael Kay made a very compelling
statement with regards to whether validation should be done at all in certain
situations. Michael was responding to the example of an online service
validating a user's address. Here's what Michael said about the online service's
insistence on validating the user's address:
The
strategy (validating the user's address) assumes that you know better than your
customers what constitutes a valid address. Let's face it, you don't, and you
never will. A much better strategy is to let them (the user) express their
address in their own terms. After all, that's what they do in old-fashioned
paper correspondence, and it seems to work quite well.
Michael
argues very effectively that in this situation it makes no sense to do any
validation at all!
Jonathan Robie rebutted Michael's argument, saying that validation is
necessary for machine processing:
In old-fashioned paper correspondence, addresses
are interpreted by human beings, and this is a perfectly fine strategy in an
application that formats addresses so that they can be read by human beings. But
if I have a program that needs to be able to identify customers in a given
region, or that needs to be able to compute the shipping costs before sending an
item, then my program needs to know how to read the address. I'm not asking the
customer to provide an address in a format that they might recognize, I'm asking
the customer to provide an address in a format that my program can use. In that
context, even if the customer finds it a little painful, I'm going to make them
communicate at least the basic information.
For addresses, many applications have a certain
middle ground. They insist on knowing the country and postal code, and perhaps
street name and number, but allow other information to be added in a way that
the program might not recognize. One more useful application of partial
understanding.
Then Frank Manola rebutted Jonathan, emphasizing that
oftentimes constraints are unknowable:
Part of the problem, though, is when the people
defining the constraints think they know the requirements of actually performing
the activity the program is supposed to help implement, but really don't; or in
the example you cite, think the address constraints they define will actually
help deliver the goods, but they actually get in the way. I experience this
problem quite frequently. My street "number" (some other street "numbers" in our
neighborhood do too) has a letter in it: 50A Butters Row (don't ask me why: I'm
not responsible for how addresses are assigned here). Sometimes a program
accepting addresses won't allow me to enter the letter (or the letter magically
becomes an apartment number, which it isn't; this is a single-family house),
because the writers of the constraints think they know how addresses are
supposed to look. Not having a street number that matches the actual address of
the house doesn't help delivery very much.
4. Fallacy of Validation as a Pass/Fail
Operation
Mary Holstege identified this fallacy.
Here's what she said:
[Many people think that
validation is a pass/fail operation.] Not so, although lots of people are still
stuck in that way of thinking, including, alas, a lot of the vendors. The schema
design goes to great pains to make it possible to do things like this, for
example: validate a document against a tight schema, and then ask questions of
the result such as "show me all the item counts that failed validation because
they were too high"
Rick Jelliffe notes that XML Schemas
validators are limited with regards to providing useful information on where an
error occurred. In fact, he argues that oftentimes an XML Schema validator
will provide wrong information about the location of an error. He notes
that this problem is not specific to XML Schema validators, but to all
grammar-based validators (e.g., XML Schemas and RelaxNG). Rick notes that
with Schematron (which is not grammar-based) you can associate specific error
messages with each assertion. Here's an example that Rick
provided:
<sch:rule
context="beatles/member">
<sch:assert test="count(../member)=4"
flag="tooManyBeatles" diagnostics="tmb">
The
beatles should have four members
</sch:assert>
<sch:rule>
..
<sch:diagnostic
id="tmb">
Check that <sch:value-of select="."/> is a correct
Beatle
</sch:diagnostic>
If the number of beatles/member
elements is not equal to 4 then a specific, user-defined error message is
spawned. This is very nice!
The above example showed generating a user-defined message when an error
occurs in the data. Rick also notes that Schematron has the ability to
generate messages when it is detected that the data is accurate.
The ability to specify user-defined messages is a very important feature, as
will be seen when fallacy #7 is examined.
5. Fallacy
of a Universal Validation Language
Dave Pawson identified
this fallacy. He noted that the Atom specification cannot be validated
using a single technology:
From [Atom, version] 0.3
onwards it's not been possible to validate an instance against a single schema,
not even Relax NG. They need a mix of Schema and 'other' processing before being
given a clean bill of health.
6. Fallacy of
Closed System Validation
This fallacy was identified by
Len a long time ago. He was discussing closed versus open
systems when he stated, "Systems leak. There's no
such thing as a closed system". This is an important comment.
Many people imagine that they can create a monolithic, invariant schema because
"there's just me and my well-known trading partners". This statement fails
to recognize the existence of a changing world; more precisely, a changing
ecosystem.
7. Fallacy that Validation is Exclusively
for Constraint Checking
I suspect that many people have
the same mentality that I had regarding validation: "An XML instance document
arrives, I forward it to a validator tool, if the validator tool doesn't
complain then I forward the instance to some software to process it. If
there's an error then discard the instance." Len has enlightened me to the
greater role that validation can
play in a system. This is discussed
below:
Launching Point for Messages
While
validating an instance document it is reasonable to generate messages - "error
messages" when errors are encountered, and even "success messages" when instance
data is found to be conformant. Thus, validation can result in spawning
messages that are sent around the system, which activate other parts of the
system. Where do the messages go? What parts of the system receive
the message? One approach is subscription - a part of the system will
receive a message only if it has subscribed to receive that type of
message. Here are some snippets of a message from Len on this use of
validation:
... if the expectation of the contract is
violated, a flag goes up and is sent to whoever has subscribed to that
event type.
And this snippet:
Aka,
event-driven intelligence: a flag is raised given recognition of a
pattern/error/trend and the system sends a subscriber(s) a
message.
Feedback-Mediated
Evolution
Suppose that during validation an exception is
raised. As discussed above, this may result in spawning a message to some
part of the system. (For example, a user enters an invalid value for a U.S.
state code, which results in sending a message to a logger routine) Len
notes, "an exception is not an error, it's a learning
(feedback) mechanism". The recipient of the
message can take
advantage of the information that the message provides. (For example, the
recipient of the invalid state code message may realize that the system should
not be forcing non-U.S. users to enter a state code; the recipient then changes
the system) Thus, validation messages become valuable feedback, which may
be used to facilitate evolution of the system.
Darwinian
Selection Process
In Darwinian evolution less fit species are
filtered out. Only the fittest species survive. You may view
validation as a process in which less fit (erroneous) instances are filtered
out, and only the fittest (conforming) instances survive. Here's a snippet
from Len on this:
... the model of selection based on
fitness or other criteria can be used to direct the evolution of the system
based on feedback in the form of messages.