Re: [xml-dev] Data versioning strategy: address semantic, relationship,a

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Re: [xml-dev] Data versioning strategy: address semantic, relationship,and syntactic changes?

From: Ian Graham <ian.graham@utoronto.ca>
To: "Costello, Roger L." <costello@mitre.org>
Date: Thu, 20 Dec 2007 17:24:07 -0500

In my discussions internally at my organization, I try to couch the versioning discussion as an impact discussion.

That is, I ask the question: "If we change something to a new version, what impact will this change have?"

Impact can be things like code rework, degree of regression testing, etc. The goal, of course, is to tune the change minimize the impact . That proves to be hard unless you control or know how consumers use your XML data, schema, SOAP message, or whatever.

So I go with Greg's comments, because when you pop up a level of abstraction, this also applies to business processes..

Truth be told, no matter how idiot- or future-proof you make your design, there is always a suitably qualified idiot who can do something you didn't expect.

Ian

Costello, Roger L. wrote:

 
Hi Folks,

Thanks for your excellent insights into the creation of a data
versioning strategy!  

I am still in the process of assimilating all of your ideas.  

The discussion has given me a glimpse into the immensity and complexity
of the "versioning strategy problem."

To help me cope with all the information, I have focused on a few
comments that were made. 

A FEW SELECT COMMENTS

Greg Hunt challenges us to think in terms of managing change as part of
a "business process":

I think that you need to look at some other things, 
semantics, structure and syntax are at too low a 
level because useful version management needs to 
be embedded in a business process or a set of 
business agreements.


Greg notes that a change may not cause syntax problems or semantic
problem, but may nonetheless cause problems:

A semantically non-breaking change for one class of 
consumer might present problems for another.  Consider 
a statistical data flow with a number of elements in it 
that are not summed (e.g. a structure containing a count 
of heart attacks, count of ambulance movements and a 
textual status report).  On the face of it, in semantic 
terms adding another statistical element for morbidity 
should not be a problem if the element can be ignored.  
However, someone out there will eventually try to count 
instances of morbidity statistics.


Bruce Cox challenges us to create a change management strategy that
makes no assumptions about the consumers of the data:

We cannot even dream of placing any constraints on the consumers of

the data.


CLARITY SOUGHT

What does this mean: "The version management needs to be embedded in a
business process"?

What does it mean: "Avoid placing constraints on consumers of the
data"?

Can we view an example of: "A semantically non-breaking change for one
class of consumer might present problems for another"?
 
EXAMPLE

Let's take an example to illustrate the ideas that Greg and Bruce are
raising.

Suppose that the Center for Disease Control (CDC) makes available data
about deaths in the U.S.  Here is sample data: 
 
VERSION 1 DATA

<deaths year="2004" source="http://www.cdc.gov/nchs/fastats/lcod.htm">
      <heart-disease>652486</heart-disease>
      <cancer>553888</cancer>
      <stroke>150074</stroke>
 
<chronic-lower-respitory-diseases>121987</chronic-lower-respitory-disea
ses>
      <accidents>112012</accidents>
      <diabetes>73138</diabetes>
      <alzheimers>65965</alzheimers>
      <influenza-and-pneumonia>59664</influenza-and-pneumonia>
 
<nephritis-and-nephrotic-syndrome-and-nephrosis>42480</nephritis-and-ne
phrotic-syndrome-and-nephrosis>
</deaths>

The data conforms to an XML Schema that the CDC created [see the schema
below].  Further, the CDC has documented the meaning of each piece of
data. [The document defines, for example, what is meant by "the number
of deaths due to accidents"]

Consumers of the CDC data happily use it.

Later, the CDC updates to also provide information on "the number of
deaths due to septicemia."  Here is a sample of the updated data: 

VERSION 2 DATA 

<deaths year="2004" source="http://www.cdc.gov/nchs/fastats/lcod.htm">
      <heart-disease>652486</heart-disease>
      <cancer>553888</cancer>
      <stroke>150074</stroke>
 
<chronic-lower-respitory-diseases>121987</chronic-lower-respitory-disea
ses>
      <accidents>112012</accidents>
      <diabetes>73138</diabetes>
      <alzheimers>65965</alzheimers>
      <influenza-and-pneumonia>59664</influenza-and-pneumonia>
 
<nephritis-and-nephrotic-syndrome-and-nephrosis>42480</nephritis-and-ne
phrotic-syndrome-and-nephrosis>
      <septicemia>33373</septicemia>
</deaths>

This data conforms to the CDC's updated XML schema, which now includes
a declaration of the <septicemia> element [see updated schema below].
The document containing the meaning of each piece of data is also
updated to define what is meant by "the number of deaths due to
septicemia."


BREAKAGE?

What will break as a result of the CDC adding the data on septicemia?


VALIDATE NEW DATA AGAINST OLD SCHEMA

Validation of the new data against the old XML Schema will result in
validation errors.  


AVERAGE NEW DATA AGAINST OLD COUNT OF DEATH CAUSES

In the version 1 data there are nine causes of death listed
(heart-disease, cancer, stroke, etc). An application which computes the
average number of deaths per cause by summing all the values and
dividing by nine will produce an incorrect answer with the new data. 

UNANTICIPATED PROBLEMS

We cannot anticipate or control what consumers of the data do with the
data or how they write their applications.  The new data could cause
problems that we cannot anticipate.   


LESSONS LEARNED?

1. Greg challenges us to think in terms of managing change as part of a
"business process."  What does this mean for the CDC example?  For
example, should the CDC post a "usage rules" to any consumers of its
data such as:

--> Do not validate the data

--> Anticipate new data will be added

2. Bruce challenges us to create a change management strategy that
makes no assumptions about the consumers of the data. What does this
mean for the CDC, which wants to add data about the number of deaths
due to septicemia? Can the CDC meet the challenge by simply setting up
two URLs, one for the old version and one for the new version?

/Roger 

----------------------------------------------
CDC VERSION 1 SCHEMA

<?xml version="1.0"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema"
        elementFormDefault="qualified">
    <element name="deaths">
        <complexType>
            <sequence>
                <element name="heart-disease" type="unsignedInt"/>
                <element name="cancer" type="unsignedInt"/>
                <element name="stroke" type="unsignedInt"/>
                <element name="chronic-lower-respitory-diseases"
type="unsignedInt"/>
                <element name="accidents" type="unsignedInt"/>
                <element name="diabetes" type="unsignedInt"/>
                <element name="alzheimers" type="unsignedInt"/>
                <element name="influenza-and-pneumonia"
type="unsignedInt"/>
                <element
name="nephritis-and-nephrotic-syndrome-and-nephrosis"
type="unsignedInt"/>
            </sequence>
            <attribute name="year" type="gYear"/>
            <attribute name="source" type="anyURI"/>
        </complexType>
    </element>
</schema>

CDC VERSION 2 SCHEMA

<?xml version="1.0"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema"
        elementFormDefault="qualified">
    <element name="deaths">
        <complexType>
            <sequence>
                <element name="heart-disease" type="unsignedInt"/>
                <element name="cancer" type="unsignedInt"/>
                <element name="stroke" type="unsignedInt"/>
                <element name="chronic-lower-respitory-diseases"
type="unsignedInt"/>
                <element name="accidents" type="unsignedInt"/>
                <element name="diabetes" type="unsignedInt"/>
                <element name="alzheimers" type="unsignedInt"/>
                <element name="influenza-and-pneumonia"
type="unsignedInt"/>
                <element
name="nephritis-and-nephrotic-syndrome-and-nephrosis"
type="unsignedInt"/>
                <element name="septicemia" type="unsignedInt"/>
            </sequence>
            <attribute name="year" type="gYear"/>
            <attribute name="source" type="anyURI"/>
        </complexType>
    </element>
</schema>
 

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php

-- 
Ian Graham // <http://www.iangraham.org>

References:
- Data versioning strategy: address semantic, relationship, and syntactic changes?
  - From: "Costello, Roger L." <costello@mitre.org>
- RE: [xml-dev] Data versioning strategy: address semantic, relationship, and syntactic changes?
  - From: "Dave Orchard" <orchard@pacificspirit.com>
- RE: [xml-dev] Data versioning strategy: address semantic, relationship, and syntactic changes?
  - From: "Costello, Roger L." <costello@mitre.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]