Re: [xml-dev] How to assess the correctness of a Format1 --> Format2 map

It depends on how similar Format 1 and Format 2 are.

I have been working on this for a large government department here, on and off, for a couple of decades.

There has been a succession of formats. They started with a CSV public format, which serialized out a hierarchical database. They changed to relational database and serialized to an XML structure, and generated the CSV from that. Some users adopted this and converted it to their own XML structure along entirely different lines++, and ultimately generate other XML (including DOCX). The department released another XML format, and generate both the CSV and the first XML structure from that. Now they are moving to JSON, but in a flat arrangement quite like the early CSVs.

So we have at least 8 comprehensively different arrangements ("editions" as distinct from "revisions") of the same data, all for different purposes and stages, and with different SDLC justifications. Some are fully normalized, some are flat, some are shallow, some are deep. The /*/* roots are all different. Some of the flat formats have tables for entities and tables for relationships, some coalesce these. Some of the XML formats are designed to be RDF-friendly, others to be publication friendly. The publications from these (including legislation) are ordered and grouped differently. A particular issue is that items have not historically had unique identifiers: so when an item goes through a pipeline, it needs to be attached to all the keys used to identify it throughout the pipeline in order to triage that problematic downstream item [X,Y,Z] is the same as item[X,Y,Z] in the database: and this information has often been stripped out, requiring fuzzy matching.

So in this situation, the testing is ordered by risk into various levels of granularity and coverage:

* At the coarsest level, we just have raw counts: there should be 10,001 distinct Items of class X, there should be 8,000 unique notes, and so on, regardless of edition. This is a sanity check, and has the advantage that it can be extracted moderately readily from each radically different format. Furthermore, it can be published, potentially, and used by external users to verify their data. (The granularity of the items counted needs to cope with the grain needed for editions that subset the data.)

* Then we validate each individual format and try to find anomalies, in particular, orphaned data items (i.e. which have no references to them.) Schematron is quite good for this. When we find an anomaly in one format, we check for other instances of the same issue, and check other formats.

* Then, prioritized by risk, we do more fine-grained checks between editions. For this, I have been using yet another XML format: this is just the JSON de-normalized into a tree to remove as much deferencing and complex XPaths as we can, but so that results can be given in terms of the JSON names and (implicit) structures. So the converters from the various XML editions to this JSON-derived XML only include the high-risk areas we are testing.

* The intent is try to get as near to end-to-end testing as possible. To do this, it is sometimes not possible to do exact matching of the same item at each end. So for these cases, we do some broader tests: for example, does every phrase in appropriate cells in the DOCX appear anywhere in any text field in any JSON data? And vice versa.

To give an example of the kind of problems I have seen, at another job they changed XML formats in a pipeline from a format where lists could have lead and final paragraphs, to one without. So if you had a triple nested list and that final list had a terminal paragraph, then in the output it would be promoted up six levels (the list items and tags) and then down by a paragraph. So the structures were very different.

So, yes, converting to a common XML format that allows comparison is good, but the more that the structure and selection of data is different between the original data is different, the less easy is may be. The choice of pivot format may need to be determined by reporting requirements: how do the humans involved think about it?

Also, where you have editions with different selections or denormalizations, you need to consider whether you are just looking that everything in A has a match in B, or whether everything in B has a match in A. This can be much simpler than finding all differences.

Regards

Rick

++ One has successfully used Highly Generic Schemas for 13 years (https://schematron.com/document/3038.html).

On Tue, Jan 24, 2023 at 6:24 AM Roger L Costello <costello@mitre.org> wrote:

Hi Folks,

There exists a proprietary data format called Format1.

A smart group of people created a non-proprietary data format called Format2. It's used to store the same kinds of data as Format1. Of course, Format2 uses different fields and structures.

I defined a mapping from Format1 to Format2.

I want to know if my Format1 --> Format2 mapping is correct.

Here's my idea on how to ascertain the correctness of the Format1 --> Format2 mapping:

Use a technology to parse Format1 instances to XML and to parse Format2 instances to XML. Compare the resulting XML's. If they are identical, then the mapping is correct.

Data Format Description Language (DFDL) is a standard language for parsing data formats. Here's the idea using DFDL:

- Create a DFDL schema for Format1.
- Create a DFDL schema for Format2.
- Design the DFDL schemas to produce the same XML format.
- To ascertain the correctness of the Format1 --> Format2 mapping do this:
- Use the Format1 DFDL schema to parse a Format1 document to XML.
- Use the Format2 DFDL schema to parse a Format2 document to XML.
- Compare the XML's:
- If they are identical, then the mapping is correct.
- If they are within a certain tolerance, then the mapping is correct.
E.g. of an acceptable tolerance: the Format2 XML omits an
optional element.
- Otherwise the mapping is not correct.

Am I thinking clearly about this problem (ascertain the correctness of the Format1 --> Format2 mapping)?

Is mapping instances of the Format1 and Format2 data formats to a common XML format logical?

What are the flaws in my thinking?

/Roger

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php