Re: [xml-dev] defining correctness for an XML transformation

I haven't read all the thread so apologies if I am re-making some point, or saying something dumb that has been pointed out.

Three quick points.

1. If your schema is recursive, then the problem of determining whether it is always correct is NP, isn't it? Infinite number of inputs. Which, if so, probably needs to limit or condition our expectations or approaches.

2. Obviously people use Schematron to validate pre and post conditions, including co-constraints between input and output all the time. it is a common use.

(Where Schematron is lacking is that it does not have a built in mechanism to force completeness: obviously it can report if there are unexplained elements that no rule in a pattern catches, but it cannot enforce eg if there is an element found that is caught by no pattern... though you can make another pattern for that. )

What becomes useful (unless every element name is only used once per document instance) is for the transformation to generate some tracing data; so that e.g. for each output element we know which element ID or xpath of the input document(s) generated it. That gets rid of many logical errors: the issue is rarely "does this input value appear in the output" but rather "does this input value appear in the output because it actually came from that input?"

3. The problem might be considered as a category error: we do not actually want to validate the Inputs and outputs, we want to validate the XSLT. I came up with an approach for a bank years ago where I converted an XSLT into just the possible element branches it could generate. This then could be stitched in with other similar data from subsequent stages in the pipeline, so that the result was an abstract document (on a spreadsheet) that showed for each input what elements could appear in the final output. This helped them feel on top of the pipeline, to reason about flaws and behaviour, without needing XSLT expertise.

This then could be paired with data extracted from a schema: for example to just do pairwise tests: for every parent/child in the abstract output, is it an allowed pair by the output schema? (IIRC James Clark put in a partial validation mode in Jing that allowed you to just test parent-child constraints not sequence or absense: this could be useful too. ) Are any required pairs missing in the input or output?

Rick

On Fri, Jul 5, 2024 at 3:01 AM C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> wrote:

Roger L Costello <costello@mitre.org> writes:

> Michael Sperberg-McQueen made this fascinating statement:
>
> ... my question [is] focused not on how to prove a transformation
> correct, but how to specify what correctness is for that
> transform.
>
> I would really like to understand that. What is the difference between
> “proving a transformation is correct” versus “specifying what
> correctness is for a transform”?

(Michael Kay has already answered this pretty well; this is partly to
say that I agree with what he said and partly to add a simple example.)

The second is (the formulation of) a statement; the first is (the
construction of) a proof that the statement is true.

For example: Suppose we specify as follows the correctness of a
transformation T taking input I and producing output O:

pre-condition: I is an XML document containing zero or more
Airport_Name elements.

post-condition: O is an XML document containing zero or more 'name'
elements, such that

(1) for every Airport_Name element $i in I, there is some name
element $o in O such that $o has only one child node (a text
node) and string($o) = normalize-space($i);

(2) for every name element $o in O, there is some Airport_Name
element $i in I such that string($o) = normalize-space($i).

Then for any specific input/output pair we can imagine checking that the
pre-condition and post-condition both hold. If I have succeeded in
making the specification unambiguous, it should always be clear whether
a given pair of I and O do or do not satisfy the conditions. From this
spec, for example, it's clear that any name in the input may occur any
non-zero number of times in the output, and any name in the output may
have occurred any non-zero number of times in the input, and that the
ordering of names in the output, and the structure of the output
document, are not constrained.

And to prove T correct, instead of just proving an individual I/O pair
correct, we can imagine proving that *whenever* the pre-condition is
satisfied, the result of running T will *always* satisfy the
post-condition. Suppose T is written in XSLT. Then we might want to
show that for every Airport_Name element in I, a particular template
will be evaluated, that that template produces zero or more suitable
'name' elements, and that no other template in the stylesheet will
produce any 'name' elements. Or we might show that a particular
variable is assigned a sequence of strings containing every distinct
Airport_Name string in I, and that a given expression in the stylesheet
serializes them all in 'name' elements (and again that nothing else in
the program will ever produce a 'name' element).

[Note: the example pre- and post-conditions I've given are formulated in
English; many people prefer a more formal language, because
statements in more formal languages can be manipulated
mechanically in ways that are helpful. Proving that T satisfies
its specification can similarly be done in prose or in a purely
formal way.

We get more confidence that the proof is correct if it can be
checked mechanically, but plenty of anecdotal evidence suggests
that even writing conditions and proofs in prose can reduce the
defect rate for software. (See for example the 'clean room
engineering' approach developed by Harlan Mills at IBM, perhaps
best described as 'semi-formal'.) There are also anecdotal
reports that just formulating the pre- and post-conditions helps,
even if there is no effort to produce a proof. I find this
plausible, since formulating pre- and post-conditions even in
prose makes me think about the program and possible edge cases in
a way that I do not otherwise always do.

I will be happiest if the answer to my question of the other day
describes a formal language analogous to those used in formal
proofs of correctness for imperative languages, but suitable for
XML. Even a coherent account of how to formulate clear, crisp,
complete, and accurate specifications of correctness in prose,
however, would look to me like progress. End of note.]

I hope this helps.

Michael

--
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php