Roger L Costello <costello@mitre.org> writes:
> Hi Folks,
>
> There exists a proprietary data format called Format1.
>
> A smart group of people created a non-proprietary data format called
> Format2. It's used to store the same kinds of data as Format1. Of
> course, Format2 uses different fields and structures.
>
> I defined a mapping from Format1 to Format2.
>
> I want to know if my Format1 --> Format2 mapping is correct.
You may find a paper I wrote a few years ago of interest. “What
constitutes successful format conversion? Towards a formalization of
‘intellectual content’.” International journal of digital curation 6.1
(2011): 153-164.
Some further work specific to equivalence of XML documents is reported
in Sperberg-McQueen, C. M., Yves Marcoux and Claus Huitfeldt. “Document
lattices: Equivalence, compatibility, and contradiction in document
markup.” Presented at Balisage: The Markup Conference 2014, Washington,
DC, August 5 - 8, 2014. In Proceedings of Balisage: The Markup
Conference 2014. Balisage Series on Markup Technologies, vol. 13
(2014). https://doi.org/10.4242/BalisageVol13.Sperberg-McQueen01.
> Here's my idea on how to ascertain the correctness of the Format1 -->
> Format2 mapping:
>
> Use a technology to parse Format1 instances to XML and to parse
> Format2 instances to XML. Compare the resulting XML's. If they are
> identical, then the mapping is correct.
The identity of the XML may establish the correct mapping *for that one
document*, but only if some conditions hold.
- The Fubar elements generated from format1 and those generated from
format2 must have the same meaning. That may be easy to guarantee
in a particular case, but I doubt that it's guaranteed in general.
- Your translation from format1 to XML must be correct.
- Your translation from format2 to XML must be correct.
I think you have started with an obligation to prove one proposition
(that your mapping from format1 to format2 is correct), and ended up
with an obligation to prove two propositions, each of which has the
same form and looks about as hard as the one you started with.
It's helpful, I think, to be able to show that the mapping works
correctly for a specific document. But I notice that it's not what you
started out to prove. You started out to prove that your mapping from
format1 to format2 was correct -- I think that that must mean that your
mapping is correct for all documents in format1.
Proving correctness for one document doesn't in itself establish
correctness for all documents.
If I told you I have a program which produces the correct results for at
least one set of inputs, you wouldn't accept that as an argument that
the program was correct. (Or, at least, you shouldn't.)
In that sense, your proposed approach is too weak: it doesn't not
establish what you said you want.
In another sense, your proposed approach may be too strict: if the
target XML format has more than one way to convey the same information,
then XML identity is too strict a test.
> Data Format Description Language (DFDL) is a standard language for
> parsing data formats. Here's the idea using DFDL:
>
> - Create a DFDL schema for Format1.
> - Create a DFDL schema for Format2.
> - Design the DFDL schemas to produce the same XML format.
> - To ascertain the correctness of the Format1 --> Format2 mapping do this:
> - Use the Format1 DFDL schema to parse a Format1 document to XML.
> - Use the Format2 DFDL schema to parse a Format2 document to XML.
> - Compare the XML's:
> - If they are identical, then the mapping is correct.
> - If they are within a certain tolerance, then the mapping is correct.
> E.g. of an acceptable tolerance: the Format2 XML omits an
> optional element.
> - Otherwise the mapping is not correct.
>
> Am I thinking clearly about this problem (ascertain the correctness of
> the Format1 --> Format2 mapping)?
I'm not sure.
You say "If [the two XML documents] are within a certain tolerance, then
the mapping is correct. E.g. of an acceptable tolerance: the Format2
XML omits an optional element."
If XML document d1 (derived from format1) specifies three things, and
XML document d2 (in format2) specifies two of those things but not the
third, I don't quite see why you should conclude that the mapping from
format1 to format2 was successful. You've lost information. Why would
the fact that you defined it as optional in your XML format make a
difference?
To take a simple case: we have translated images in format1 and format2
into XML -- for concreteness, let's say we have translated them into
something very like SVG. Document d1 specifies a rectangle, shaded in
yellow, with blue text of a certain font and size. d2 also species a
rectangle and text, but doesn't specify the color of either. Color
information is optional in the XML format.
So ... blue text on a yellow background has become
&default-foreground-color; text on a &default-fill-color; background.
Depending on the goals of your mapping, this either matters (i.e. the
mapping has failed) or it doesn't (because all you care about is the
sequence of characters in the text, and everything else is irrelevant).
It's probably better to start by saying what information needs to be
preserved in the mapping and what information does not need to be
preserved (and perhaps should be thrown away systematically -- are you
worried about steganography?).
> Is mapping instances of the Format1 and Format2 data formats to a
> common XML format logical?
It's certainly a useful way to get them into a format that allows them
to be compared directly. The problem of equivalence between XML
documents is not simple (as the example in the second paper mentioned
above illustrates), but I like to think it's soluble in principle. But
if you have a skeptical audience, then the correctness of your
translations into XML must also be shown. If you have an audience
willing to take the translations from format1 and format2 into XML on
faith, why do they not want to take the translation from format1 to
format2 on faith?
Thought-provoking question - thanks.
--
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com
_______________________________________________________________________
XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.
[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php