RE: [xml-dev] Data versioning strategy: address semantic,relationship, a

Greg, Roger, I hope you won’t mind if I give some of your interesting ideas a bit of a reality test.

In summary:

For us, changes are usually business driven and decided on cost, and, no, it makes little or no difference what kind of change it is.

In exhausting detail:

At the USPTO, our versioning strategy for the DTD’s and style sheets used for patent publications is driven almost entirely by cost. When a change in a business process provokes a change in patent publications (about 10,000 documents per week), we look at the entire pipeline, including data source, storage, processing, validation, export to publishing contractor, publication, dissemination, consumption by internal search systems, consumption by international exchange partners, consumption by commercial value-added resellers, archives, and final disposition. Changes to the governing DTD and style sheets are based on that entire analysis. To the extent possible, changes are made no more frequently than annually and announced six months in advance, primarily so that everyone can get the funding in place in time, make changes, test changes, notify customers, test changes, retrain staff, test changes, update product descriptions, test changes, etc. We like to test changes on a minimum of two or more weeks of data (20 to 40 thousand documents), but sometimes do it across many months of data through parallel runs.

Granted, our universe is limited in scope. There are only about 120 patent offices in the world, only a handful use our XML data, and there are fewer than 50 value-added resellers who use our XML data that we know of. Nevertheless, we identify all changes to everyone we know to be using the data, since we cannot predict what will or won’t break someone else’s system. Our business is such that we cannot even dream of placing any constraints on the consumers of the data. If we miss some of the unknown users, and a change breaks their system, we usually hear about it, especially if it tends to put them out of business. This has happened with the most innocuous or seemingly trivial of changes as well as the more dramatic changes. Sometimes we can fix it, sometimes not; you can imagine the rest … .

It has happened here more than once that some bright idea that seemed to solve a major problem received enough analysis to for us to realize that the cost of implementation far outweighed the benefit. All our changes are “strong” in the sense of being well-specified. If they aren’t well-specified, they become well-specified, or they don’t survive analysis and don’t get implemented. Even the bright ideas that are ultimately abandoned have to be sufficiently well-specified to determine if they can be implemented.

Ontologies and such are usually indecipherable to those who don’t know the business they describe, and superfluous to those who do. Most major business changes in the patent system occur as a result of an act of Congress or as the outcome of some litigation. In both cases, the Office writes rules that set the meaning of terms for better or worse (and sometimes get revised accordingly), usually based on the language used by Congress or the court. I don’t think there is any mechanical substitute for learning the business you want to engage with. The world of commerce is far too dynamic for that. In any case, all changes bite someone, hard or not, sooner or later, so we have little choice but to treat them all as much the same, so we don’t categorize them in any way, once agreed.

During analysis, we take into account the expected benefit as compared to cost, where it can be sometimes useful to understand a change as syntactical only (very low cost as a rule), or structural (more costly, depending on the scope). Semantic changes are always very costly in the sense of having to retrain habitual users of the data in the new interpretations required. However, this rarely impacts the DTD (unless there are corresponding changes in structure as well) and is therefore not usually funded from the IT budget. Nevertheless, considerations for the cost of training can stop an inexpensive DTD change.

There are a number of WIPO Standards that document the meaning of industrial property terminology. These formed the basis of the vocabulary used in WIPO Standard ST.36, which the USPTO implements as Red Book. For the most part, for a given element name, all the member states of WIPO assign the same meaning. However, the harmony is often somewhat superficial, hiding a multitude of variations in rules, traditions, and understanding, among the member states. That there is as much agreement as there is might be considered an achievement worthy of note. Without that, I dare say ST.36 could not exist.

And yes, the intellectual property community uses those two-letter ISO country codes for a number of purposes, including place of birth, primary residence, place of filing, mailing address, agent’s address, states designated under the PCT, etc., etc. WIPO Standard ST.3 incorporates, sometimes modifies, and even augments the list with codes for regional authorities that play the role of a patent office for more than one country. WIPO member states frequently revisit the list as political boundaries change, since the scope of patents is generally limited to a political territory. Countries usually enact legislation defining the changes in scope of the rights attached to a patent corresponding to the changes in political boundaries.

Bruce B Cox

Manager, Standards Development Division

U.S. Patent & Trademark Office

This email expresses my personal opinions only and should not be construed as representing official USPTO policy.

From: Greg Hunt [mailto:greg@firmansyah.com]
Sent: 2007 December 09, Sunday 14:47
To: xml-dev@lists.xml.org
Subject: Fwd: [xml-dev] Data versioning strategy: address semantic, relationship, and syntactic changes?

Roger,
I think that you need to look at some other things, semantics, structure and syntax are at too low a level because useful version management needs to be embedded in a business process or a set of business agreements. The real question is how do we identify breaking and non-breaking changes? And then: how do we embed that identification in a change management process that minimises pain? For simple exchanges we can nail down the purposes that people put the data to fairly easily and change management is straight-forward. For more data exchanges involving multiple parties and multiple uses (I am thinking of some operational/statistical exchanges between Government agencies here), it is much, much harder. The agency concerns do not overlap neatly.

We might distinguish between strong and weak change management in this context, strong being highly specified change management and weak relying on human inspection and thought. Most of what follows addresses strong change management.

Change management requires constraints on users

For a versioning strategy to work there need to be constraints on the consumers (how do they extract element values, what aspects of the schema or message structure are they sensitive to), once that is done, then you can start to specify what a breaking or non-breaking change is. For example I have seen code that simply walks through the DOM nodes of a document, extracting element values as it goes, making assumptions about the oder and types of nodes. For that code there is no technical/structural change that is a non-breaking change because even whitespace is significant to that consumer.

On Meaning: Certain meaning is harder than "good enough", "good enough" breaks in surprising ways

Technical breakage is one thing, semantic breakage is another. As an industry and despite a lot of people going on about semantic markup and ontologies we tend to underestimate just how fuzzy the terms that we use are. Too many ontologies deal only with simply and sharply defined nouns, many business processes deal with things that are difficult to pin down really precisely because they are fuzzy around the edges because of either: rapid change, limited knowledge or because we simply do not care about the fine details of the definition. An example of the not caring is the common and unthinking use of postal delivery geography from ISO 3166 as a set of country codes - Bouvet Island, permanent human population zero is a country, Scotland with its own parliament and historical identity is not? Given that, do we really know what country codes mean? There are other more business context specific examples: where do order prices come from? what is a bread order? what exactly is delivery? Can you tell that I am not wild about your statement that semantics can be defined in a data dictionary? I am not sure that we can really pin down meaning in a complete way, but we can get to "good enough" without too much difficulty. The problem is that "good enough" must be checked and renegotiated whenever the world changes. For example consider what happens when you try to use ISO 3166 for country of birth (I've seen it done). Any idea how many distinct political entities there have been called Lithuania in the last 100 years? Is East Germany the same as the Federal Republic of Germany? What do we do with decomposition, like Yugoslavia, where there is no simple mapping between the aggregate and the current set of entities.

Consumers use data for different purposes, defining the purposes is difficult beyond very small numbers of consumers of the data

A semantically non-breaking change for one class of consumer might present problems for another. Consider a statistical data flow with a number of elements in it that are not summed (eg a structure containing a count of heart attacks, count of ambulance movements and a textual status report). On the face of it, in semantic terms adding another statistical element for morbidity should not be a problem if the element can be ignored. However, someone out there will eventually try to count instances of morbidity statistics. If the semantics is like a set of Russian dolls, where do you stop?

Some thoughts about semantic operations - an ontology of purposes?

If we are going to try to manage semantic change, we need to address the scope of the semantics beyond dictionary definitions. There are operations that are based in semantics. For example are the structures that makes up a document countable, summable or comparable inside and between instances of the document? Countable meaning whether the number of instances of the "thing" has any meaning at all. Summable meaning whether two instances of the element can be combined in some way, comparable meaning whether two instances of an element can be compared (comparison by name? comparison by structure and name?). Are two instances of an address structure comparable if they have different structure versions? That depends on the intent of the comparer. Addresses have a number of purposes and a change may only impact one purpose. Adding a postal delivery point ID to a physical adress used for legal service is likely (only likely, not guaranteed) to have no effect at all on the service purpose. If these types of operations are defined then the impact of a change can be more clearly specified.

Is an ontology of purposes possible?

For a strong versioning strategy/change managment strategy to work, we need an ontology that is tied to the document structure so that we can minimse ambiguity. For this element, what guarantees can we make and what operations are supported? What operations will we support? If we merge a postal address and a physical address (because they are identical), are we allowed to count address elements or do we have to count the number of purposes that the elements are put to? Is this possible at all?

Versioning - the original question

Its not a verioning strategy that is needed. We can attach some kind of version identifier, do stuff to make the versions identifiable and to an extent backward compatible, but the problem is the change management strategy.

Can we identify change that has an impact? For some purposes we can, but in non-trivial cases we can never be really sure that we have captured the definition of a significant change.

Are the distinctions between the types of change significant? I suspect that in reality they are not. They will all bite in interesting ways. We can minimise the amount of breaking change through various techniques, but those techniques are like those applied to object models - if you get it right it works really well, if you mistake the direction of change you have a big problem. The XML tool sets that we have make responding to breaking change a bit easier, they are not guaranteed to make it simple and it probably should in any case not be transparent.

Greg

On 12/8/07, Costello, Roger L. < costello@mitre.org> wrote:

Hi Folks,

Oftentimes when discussing a "versioning strategy" I focus on how to
design schemas in a fashion to lessen the impact of changes.  It occurs
to me that this addresses only one aspect of the data versioning
problem.  Below I have attempted to identify other issues to be
addressed in a data versioning strategy.  I am interested in hearing
your thoughts on this.

EVOLVING DATA

Suppose some data is regularly exchanged between machines:

Machine 1 --> data --> Machine 2
Machine 1 <-- data <-- Machine 2

Periodically the data changes due to requirement changes, additional
insights, or from innovation.

A change results in a new "version" of the data.

PROBLEM

What are the categories of changes that may occur?  What categories of
changes must be dealt with by a data versioning strategy?

CATEGORIES OF CHANGE

1. Semantic - the meaning of the data changes.

Example:

version 1 data: a "distance" value means the distance from the center
of town.

version 2 data: a distance value means the distance from the town line.

2. Relationship - the relationship between the data changes.

Example:

version 1 data: there is a co-constraint between the start-time and the
end-time.

version 2 data: there is a three-way co-constraint between start-time,
end-time, and mode-of-transportation.

3. Syntax - the structure of the data changes.

Example:

version 1 data: the employee data is listed first and the person's name
is given by his given-name and surname.

version 2 data: the department data is listed first and in the employee
data each person's name additionally contains a middle name.

SUPPORTING TECHNOLOGIES

Suppose the data being exchanged is formatted using the XML syntax.

Machine 1 --> XML --> Machine 2
Machine 1 <-- XML <-- Machine 2

What technologies support the above categories of change?

1. Semantic: A data dictionary may be used to define meaning.

2. Relationship: Schematron may be used to express relationships
between data.

3. Syntax: XML Schema, Relax NG, or DTD may be used to express the
structure of the data.

REQUIREMENTS ON A VERSIONING STRATEGY

A versioning strategy must take into consideration:

- changes in the semantics of the data
- changes in the relationships of the data
- changes in the syntax of the data

When data is in an XML format then a versioning strategy must
implement:

- versioning a data dictionary
- versioning a Schematron schema
- versioning an XML Schema, Relax NG schema, or DTD

QUESTIONS

a. Do you agree with the three categories of change?

b. Do these categories represent all types of change?

c. Do you agree that a versioning strategy must address semantic,
relationship, and syntactic changes?

/Roger

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php