xml-dev - Re: [xml-dev] A multi-step approach on defining object-oriented nature o

Re: [xml-dev] A multi-step approach on defining object-oriented nature o

[ Lists Home | Date Index | Thread Index ]

To: XML DEV <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] A multi-step approach on defining object-oriented nature of DOM
From: "W. E. Perry" <wperry@fiduciary.com>
Date: Thu, 22 Aug 2002 16:05:22 -0400
Organization: Fiduciary Automation
References: <8BD7226E07DDFF49AF5EF4030ACE0B7E06621F77@red-msg-06.redmond.corp.microsoft. com> <1029896844.25933.160.camel@marajen> <3D63D550.4090904@textuality.com> <3D648D0A.A35D10C5@fiduciary.com> <3D6506D3.9BE6944A@prescod.net>

Paul Prescod wrote:

> The use of the word "data structure" is confusing. I see "data structure" as being an
> aspect of implementation and unrelated to the data on the wire. A purchase order schema
> is not defining a "data structure", it is defining an XML "vocabulary" or "file format".

I would say that the primary feature of XML documents is that each is explicitly a data
structure. I am actually astonished to see that questioned. A DTD of schema describes the
data structure to which a document conforms, but any instance of simple well-formed XML no
less explicitly presents its data structure through markup.

> I will defend to the death the right of applications to define their own implementation
> data structures without any concern whatsoever for the rest of the world. But I cannot
> understand why you deny the widely held belief (held both inside and outside the XML
> world) that the signature of a function is both its input and its output and that both
> should be defined formally.

Because that design traps an application in a weave of shared a priori agreements, which
vitiates the expertise of the application. Applications are valuable because they execute
functions with particular expertise. Very significant components of that expertise are data
collection and instantiation on the input side and on the output side the presentation of a
form most specifically suited to expression of the value which the expertise of the
application has added. If either of these is constrained by anything other than the
expertise implemented in the application itself, that expertise is thereby compromised.

It is because of the particularity of their expertise that a pair of applications is
unlikely to have a single specific data structure in common, as the best expression both of
the expert output of one and the particular data input requirements of the other. Three
applications are orders of magnitude less likely to share such a structure. Yet the
internetwork topology of the Web--and, I would argue, your own REST principles--operate
because of the publication of each output for the potential use of many applications for
different purposes. This freedom to share is implemented with http verbs. It is madness to
constrain it further by demanding that interprocess communication--already well implemented
by http verbs--be cast in agreed data structures which serve the particular needs of
neither party to a communication and, in fact, constrain both parties from the necessary
exercise of their expertise in data collection or output presentation.

> Well, no, there are (at least?) three different roles for input validation. The first is
> merely to offload syntactic error checking from your application to a purpose-built
> component, the schema validator. The second is to communicate these expectations to a
> third party (to create either a compatible producer application or an alternate consumer
> application). The third is to build agreement between multiple parties before
> implementation begins. I believe you are concentrating on the last.

Only to damn it, to inveigh against it, and to persuade implementors that that way lies
madness.

> It would really help if you could provide some specific, concrete examples. The usual
> model is that in order to buy something you submit a purchase order in a well-known
> vocabulary and receive in return a receipt (modulo all kinds of negotiations, acceptance
> protocols, etc.). Input->function->output. Input and output are publically, formally
> defined. The function is defined only by its implementation. I can show this in
> mind-numbing syntactic detail if it helps. Now what does your
> model look like?

[These are much repeated examples from my day job, offered here with apologies to those who
may have seen them more than once before.]

In my world, in order to buy something you submit an order in a vocabulary well-known to
the order document creator but quite possibly unknown and unusable on its own terms to the
application which will fill the order. A money manager in heartland USA has never before
submitted an order to buy securities in Malaysia, but is now persuaded by a salesman (of
Malaysian securities, presumably) that this would be a good idea. That money manager has a
computer system which it must use to produce that order because all of the automation of
the compliance and regulatory reporting tasks required of that money manager is built
around the basic operations of that system. Therefore as processed by that system the buy
order produced as output is specific to the US-only domestic form which makes up the
majority of that money manager's business. Unfortunately, that form is unknown, and
contains content which does not apply to, the Malaysian order execution application which
must now process it.

In the early 1980's we solved these problems by building massive any-to-any transformation
switches, capable of going from the output of any process used by any of our customers or
their counterparties to the input of any other application to which we had ever seen it
connected. This is a disaster (though still the norm in our industry). The permutations
increase geometrically, as does the complexity involved in applying any changes required by
the input or the output of any one application. And that is before you deal with the
scoping issues of which pairs of application (and in which order of operation!) have
private understandings of each other's vocabularies.

For the past 12 years (first with homegrown syntactic rules, and since 1998 with
well-formed XML) we have built and operated all of our systems on the principles I am
promoting here. We assume that the form of order output by the money manager's system is
fixed by the expertise (and in this case the legal compliance and reporting requirements)
of its operation in its local milieu. We also assume, which the designers of that system
did not, that its output will have to be used as input by other applications which have
never seen anything quite like it. Correspondingly, we assume that the only output of the
Malaysian order processing application is an order execution of the form understood in
Malaysia, but that that application will have to take in 'orders' from all over the world,
each in a format local to its origin.

Building systems which put expertise first turns out to be straightforward and, even
better, when existing systems are adapted to operate in this way all of their specialized
processes can continue to run unchanged. These systems can be quite cleanly wrapped in a
data instantiation layer which looks at the internal data structures of existing processes
in order to derive the form of data presentation which the application requires if it is to
operate. In the case of the Malaysian order execution application, the data instantiation
layer begins with the assumption that the input document presented is, in some sense, an
order because it appears at this location where orders are presented. It may, of course,
turn out not to be an order, in which case it will have to be rejected because the order
execution application can do nothing with it.

The data instantiation layer begins by searching its locally-maintained history of previous
successful data instantiations for a match first on the provenance of this document or,
failing that, on its element structure. In practice, the overwhelming majority of orders
from a given source exhibit exactly the same structure, and the internal application data
structure required can therefore be immediately instantiated on the model of previously
successful instantiations in the history. Where a given instance is different from the
usual pattern from that source, the change is usually small and quite often actually occurs
in some portion of the offered input which is not used by this particular application. In
such cases, again, the locally required data instantiation can be accomplished immediately.

Where there does not appear to be another order from the same source recorded in the
history, the correct instantiation will very often be immediately identified from the form
of the input presented. This is simply an acknowledgment that in a given specialized domain
there are often only a few software vendors, and while this Malaysian node may have never
seen an order from this particular money manager before, it has probably seen an identical
form of order from another money manager running the same order creation software.

Failing both of those routes to instantiation, there are still only a few fields which are
likely to appear in a securities order, and only a subset of those are of interest to this
Malaysian order execution software. Quite often some, even all of the fields of interest
can be identified through the form of their content as examined through simple regex
processing. I hesitate (greatly!) to use the term 'brute force', but in my experience there
have also been quite a few cases of data identified and correctly instantiated because the
data instantiation knew that there were, for example, only five fields of interest to the
application, and it did not take much compute time to try instantiating every field
presented as every one of the possible fields of interest, to see if any permutation gave a
whole which made sense. Bear in mind also that in securities processing every step of
execution is followed by a step of comparison of the outcomes between counterparties. In
the extremely unlikely case that a brute force instantiation attempt through all the
permutations of input to internal data fields could have resulted in a sensible whole of a
data structure, which then successfully executed in the application--if against those
astronomical odds the data was in fact processed in error, that error will be picked up at
the very next step, when it fails to compare.

Yet there will of course be some very small number of input documents which the
instantiation layer can do nothing with, particularly when it is seeing a form of data
input for the first time. Humans will have to get involved here. But, again, securities
processing has long established the standard business practice of kicking out exceptions
for fixup offline. Getting humans involved in the instances where truly necessary is not a
case of abdicating in the automation of domain expertise, but actually the appropriate
deference to industry practice.

> > ... The autonomous processing nodes of the
> > internetwork topology are of value because of what they produce, which is to
> > say the expertise which they implement.
>
> Not necessarily. Sometimes they are of value simply because they know how to accept
> information.

Knowing how to accept information--that is, knowing how to instantiate it--is knowing how
to process it. As implemented, it is, in fact, processing it. That is one of my two main
points. (The other is that *how* a process knows how to instantiate data is based on its
internal data structure, not on some external schema.)

> It is the *side effects* of this information acceptance (i.e. the shipping of running
> shoes to a consumer) that is of value. The same goes for email. My computer produces
> little interesting output after accepting an email.

This is not side effects; this is process outcome. Process outcome is the intended effect,
and in fact I am bucking the orthodoxy when I insist that it is the *only* intended effect.
The purpose of executing a process is to produce an expert outcome, not to fulfill the
expectations of some upstream process which believes that it is invoking another process by
presenting it with a given data structure.

> Perhaps you differ in your view of the architecture because you have a different problem
> domain than most other
> people.

Perhaps. I hope that I have adequately illustrated it above.

> How do you get the output until you've given input?

Autonomous processes produce very particular output and publish it in Web-accessible
locations. (This strikes me as the soul of REST.) Go get it and see what it is. If you
declare that you are an interested party to that output, even if it is only a sample, and
you satisfy standard Web tools controlling access to it, you should be permitted to GET it.

> When you say "discovery" do you mean the human being looking out the output at design
> time or the runtime software process doing so without human intervention? I can't say I
> enjoy picking through electronic scat but either do I know how to write a program that
> can do it.

As detailed above, this is not at design time but at the time of resolving exceptions from
run time. I think I describe above how much of this picking through scat can be automated,
and what small percentage of it falls to humans.

> How can a process select the data it will work upon without a priori knowledge of the
> semantics of the element types.

It has a priori (and authoritative) knowledge of its own data needs.

> If you use "P" to mean paragraph and I use it to mean purchase order (perhaps even with
> the same namespace qualification) then the process cannot reliably act on the data it
> receives from us.

This is the usual objection and is a red herring. It doesn't matter what I use "P" to mean,
or what you use it for. We have not set out to agree on a data structure or a standard data
vocabulary. All that my data instantiation model is concerned with is whether in your "P",
or anywhere else in what you present me, I can find what I need to instantiate the
particular data structure which is my true prerequisite to executing an instance of my
expert processing.

> I feel, therefore, that semantics must be agreed upon in advance (or at least mappable to
> agreed semantics). Standardizing syntax in a schema at the same time bolsters this
> semantic agreement and makes application writing easier.

Nope. Otherwise my specific expertise, and its requisite domain vocabulary, become muddied
with yours and both of our processes are dragged down by it into a common denominator of
mediocrity.

Respectfully,

Walter Perry

Follow-Ups:
- Hobbsian processes
  - From: Paul Prescod <paul@prescod.net>

References:
- RE: [xml-dev] A multi-step approach on defining object-orientednature of DOM
  - From: Amelia A Lewis <amyzing@talsever.com>
- Re: [xml-dev] A multi-step approach on defining object-oriented natureof DOM
  - From: Tim Bray <tbray@textuality.com>
- Re: [xml-dev] A multi-step approach on defining object-oriented nature of DOM
  - From: "W. E. Perry" <wperry@fiduciary.com>
- Re: [xml-dev] A multi-step approach on defining object-oriented nature of DOM
  - From: Paul Prescod <paul@prescod.net>

Prev by Date: RE: [xml-dev] Schema Sanity Check
Next by Date: How to get realistic expectations for Namespaces (was Best Practice = Vanilla XML? )
Previous by thread: Re: [xml-dev] A multi-step approach on defining object-oriented nature
Next by thread: Hobbsian processes
Index(es):
- Date
- Thread