Greg,
Roger, I hope you won’t mind if I give some of your interesting ideas a bit
of a reality test.
In
summary:
For
us, changes are usually business driven and decided on cost, and, no, it makes
little or no difference what kind of change it is.
In
exhausting detail:
At
the USPTO, our versioning strategy for the DTD’s and style sheets used
for patent publications is driven almost entirely by cost. When a change in
a business process provokes a change in patent publications (about 10,000 documents
per week), we look at the entire pipeline, including data source, storage, processing,
validation, export to publishing contractor, publication, dissemination, consumption
by internal search systems, consumption by international exchange partners,
consumption by commercial value-added resellers, archives, and final
disposition. Changes to the governing DTD and style sheets are based on
that entire analysis. To the extent possible, changes are made no more
frequently than annually and announced six months in advance, primarily so that
everyone can get the funding in place in time, make changes, test changes, notify
customers, test changes, retrain staff, test changes, update product
descriptions, test changes, etc. We like to test changes on a minimum of
two or more weeks of data (20 to 40 thousand documents), but sometimes do it
across many months of data through parallel runs.
Granted,
our universe is limited in scope. There are only about 120 patent offices
in the world, only a handful use our XML data, and there are fewer than 50
value-added resellers who use our XML data that we know of. Nevertheless,
we identify all changes to everyone we know to be using the data, since we
cannot predict what will or won’t break someone else’s system.
Our business is such that we cannot even dream of placing any constraints on
the consumers of the data. If we miss some of the unknown users, and a
change breaks their system, we usually hear about it, especially if it tends to
put them out of business. This has happened with the most innocuous or
seemingly trivial of changes as well as the more dramatic changes. Sometimes
we can fix it, sometimes not; you can imagine the rest … .
It
has happened here more than once that some bright idea that seemed to solve a
major problem received enough analysis to for us to realize that the cost of implementation
far outweighed the benefit. All our changes are “strong” in
the sense of being well-specified. If they aren’t well-specified,
they become well-specified, or they don’t survive analysis and don’t
get implemented. Even the bright ideas that are ultimately abandoned have
to be sufficiently well-specified to determine if they can be implemented.
Ontologies
and such are usually indecipherable to those who don’t know the business
they describe, and superfluous to those who do. Most major business
changes in the patent system occur as a result of an act of Congress or as the
outcome of some litigation. In both cases, the Office writes rules that set
the meaning of terms for better or worse (and sometimes get revised
accordingly), usually based on the language used by Congress or the court.
I don’t think there is any mechanical substitute for learning the
business you want to engage with. The world of commerce is far too
dynamic for that. In any case, all changes bite someone, hard or not,
sooner or later, so we have little choice but to treat them all as much the
same, so we don’t categorize them in any way, once agreed.
During
analysis, we take into account the expected benefit as compared to cost, where
it can be sometimes useful to understand a change as syntactical only (very low
cost as a rule), or structural (more costly, depending on the scope). Semantic
changes are always very costly in the sense of having to retrain habitual users
of the data in the new interpretations required. However, this rarely
impacts the DTD (unless there are corresponding changes in structure as well)
and is therefore not usually funded from the IT budget. Nevertheless,
considerations for the cost of training can stop an inexpensive DTD change.
There
are a number of WIPO Standards
that document the meaning of industrial property terminology. These
formed the basis of the vocabulary used in WIPO
Standard ST.36, which the USPTO implements as Red
Book. For the most part, for a given element name, all the member
states of WIPO assign the same meaning. However, the harmony is often somewhat
superficial, hiding a multitude of variations in rules, traditions, and
understanding, among the member states. That there is as much agreement
as there is might be considered an achievement worthy of note. Without
that, I dare say ST.36 could not exist.
And
yes, the intellectual property community uses those two-letter ISO country
codes for a number of purposes, including place of birth, primary residence,
place of filing, mailing address, agent’s address, states designated
under the PCT, etc., etc. WIPO
Standard ST.3 incorporates, sometimes modifies, and even augments the list
with codes for regional authorities that play the role of a patent office for
more than one country. WIPO member states frequently revisit the list as
political boundaries change, since the scope of patents is generally limited to
a political territory. Countries usually enact legislation defining the changes
in scope of the rights attached to a patent corresponding to the changes in political
boundaries.
Bruce
B Cox
Manager,
Standards Development Division
U.S.
Patent & Trademark Office
This
email expresses my personal opinions only and should not be construed as
representing official USPTO policy.
From: Greg Hunt
[mailto:greg@firmansyah.com]
Sent: 2007 December 09, Sunday 14:47
To: xml-dev@lists.xml.org
Subject: Fwd: [xml-dev] Data versioning strategy: address semantic,
relationship, and syntactic changes?
Roger,
I think that you need to look at some other things, semantics, structure and
syntax are at too low a level because useful version management needs to be
embedded in a business process or a set of business agreements. The real
question is how do we identify breaking and non-breaking changes? And
then: how do we embed that identification in a change management process that
minimises pain? For simple exchanges we can nail down the purposes that
people put the data to fairly easily and change management is
straight-forward. For more data exchanges involving multiple parties and
multiple uses (I am thinking of some operational/statistical exchanges between
Government agencies here), it is much, much harder. The agency concerns
do not overlap neatly.
We might distinguish between strong and weak change management in this context,
strong being highly specified change management and weak relying on human
inspection and thought. Most of what follows addresses strong change
management.
Change management requires constraints on users
For a versioning strategy to work there need to be constraints on the consumers
(how do they extract element values, what aspects of the schema or message
structure are they sensitive to), once that is done, then you can start to
specify what a breaking or non-breaking change is. For example I have
seen code that simply walks through the DOM nodes of a document, extracting
element values as it goes, making assumptions about the oder and types of
nodes. For that code there is no technical/structural change that is a
non-breaking change because even whitespace is significant to that consumer.
On Meaning: Certain meaning is harder than "good enough", "good
enough" breaks in surprising ways
Technical breakage is one thing, semantic breakage is another. As an
industry and despite a lot of people going on about semantic markup and
ontologies we tend to underestimate just how fuzzy the terms that we use
are. Too many ontologies deal only with simply and sharply defined nouns,
many business processes deal with things that are difficult to pin down really
precisely because they are fuzzy around the edges because of either: rapid
change, limited knowledge or because we simply do not care about the fine
details of the definition. An example of the not caring is the common and
unthinking use of postal delivery geography from ISO 3166 as a set of country
codes - Bouvet Island, permanent human population zero is a country, Scotland
with its own parliament and historical identity is not? Given that, do we
really know what country codes mean? There are other more business
context specific examples: where do order prices come from? what is a bread
order? what exactly is delivery? Can you tell that I am not wild about
your statement that semantics can be defined in a data dictionary? I am
not sure that we can really pin down meaning in a complete way, but we can get
to "good enough" without too much difficulty. The problem is
that "good enough" must be checked and renegotiated whenever the
world changes. For example consider what happens when you try to use ISO
3166 for country of birth (I've seen it done). Any idea how many distinct
political entities there have been called Lithuania in the last 100 years?
Is East Germany the same as the Federal Republic of Germany? What do we
do with decomposition, like Yugoslavia, where there is no simple mapping
between the aggregate and the current set of entities.
Consumers use data for different purposes, defining the purposes is difficult
beyond very small numbers of consumers of the data
A semantically non-breaking change for one class of consumer might present
problems for another. Consider a statistical data flow with a number of
elements in it that are not summed (eg a structure containing a count of heart
attacks, count of ambulance movements and a textual status report). On
the face of it, in semantic terms adding another statistical element for
morbidity should not be a problem if the element can be ignored. However,
someone out there will eventually try to count instances of morbidity
statistics. If the semantics is like a set of Russian dolls, where do you
stop?
Some thoughts about semantic operations - an ontology of purposes?
If we are going to try to manage semantic change, we need to address the scope
of the semantics beyond dictionary definitions. There are operations that
are based in semantics. For example are the structures that makes up a
document countable, summable or comparable inside and between instances of the
document? Countable meaning whether the number of instances of the
"thing" has any meaning at all. Summable meaning whether two
instances of the element can be combined in some way, comparable meaning
whether two instances of an element can be compared (comparison by name?
comparison by structure and name?). Are two instances of an address
structure comparable if they have different structure versions? That
depends on the intent of the comparer. Addresses have a number of
purposes and a change may only impact one purpose. Adding a postal
delivery point ID to a physical adress used for legal service is likely (only
likely, not guaranteed) to have no effect at all on the service purpose.
If these types of operations are defined then the impact of a change can be
more clearly specified.
Is an ontology of purposes possible?
For a strong versioning strategy/change managment strategy to work, we need an
ontology that is tied to the document structure so that we can minimse
ambiguity. For this element, what guarantees can we make and what
operations are supported? What operations will we support? If
we merge a postal address and a physical address (because they are identical),
are we allowed to count address elements or do we have to count the number of
purposes that the elements are put to? Is this possible at all?
Versioning - the original question
Its not a verioning strategy that is needed. We can attach some kind of
version identifier, do stuff to make the versions identifiable and to an extent
backward compatible, but the problem is the change management strategy.
Can we identify change that has an impact? For some purposes we can, but
in non-trivial cases we can never be really sure that we have captured the
definition of a significant change.
Are the distinctions between the types of change significant? I suspect
that in reality they are not. They will all bite in interesting
ways. We can minimise the amount of breaking change through various techniques,
but those techniques are like those applied to object models - if you get it
right it works really well, if you mistake the direction of change you have a
big problem. The XML tool sets that we have make responding to breaking
change a bit easier, they are not guaranteed to make it simple and it probably
should in any case not be transparent.
Greg
On 12/8/07, Costello, Roger L.
< costello@mitre.org>
wrote:
Hi Folks,
Oftentimes when discussing a "versioning strategy" I focus on how to
design schemas in a fashion to lessen the impact of changes. It
occurs
to me that this addresses only one aspect of the data versioning
problem. Below I have attempted to identify other issues to be
addressed in a data versioning strategy. I am interested in hearing
your thoughts on this.
EVOLVING DATA
Suppose some data is regularly exchanged between machines:
Machine 1 --> data --> Machine 2
Machine 1 <-- data <-- Machine 2
Periodically the data changes due to requirement changes, additional
insights, or from innovation.
A change results in a new "version" of the data.
PROBLEM
What are the categories of changes that may occur? What categories
of
changes must be dealt with by a data versioning strategy?
CATEGORIES OF CHANGE
1. Semantic - the meaning of the data changes.
Example:
version 1 data: a "distance" value means the distance from the center
of town.
version 2 data: a distance value means the distance from the town line.
2. Relationship - the relationship between the data changes.
Example:
version 1 data: there is a co-constraint between the start-time and the
end-time.
version 2 data: there is a three-way co-constraint between start-time,
end-time, and mode-of-transportation.
3. Syntax - the structure of the data changes.
Example:
version 1 data: the employee data is listed first and the person's name
is given by his given-name and surname.
version 2 data: the department data is listed first and in the employee
data each person's name additionally contains a middle name.
SUPPORTING TECHNOLOGIES
Suppose the data being exchanged is formatted using the XML syntax.
Machine 1 --> XML --> Machine 2
Machine 1 <-- XML <-- Machine 2
What technologies support the above categories of change?
1. Semantic: A data dictionary may be used to define meaning.
2. Relationship: Schematron may be used to express relationships
between data.
3. Syntax: XML Schema, Relax NG, or DTD may be used to express the
structure of the data.
REQUIREMENTS ON A VERSIONING STRATEGY
A versioning strategy must take into consideration:
- changes in the semantics of the data
- changes in the relationships of the data
- changes in the syntax of the data
When data is in an XML format then a versioning strategy must
implement:
- versioning a data dictionary
- versioning a Schematron schema
- versioning an XML Schema, Relax NG schema, or DTD
QUESTIONS
a. Do you agree with the three categories of change?
b. Do these categories represent all types of change?
c. Do you agree that a versioning strategy must address semantic,
relationship, and syntactic changes?
/Roger
_______________________________________________________________________
XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.
[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php