It depends on how similar Format 1 and Format 2 are.
I have been working on this for a large government department here, on and off, for a couple of decades.
There has been a succession of formats. They started with a CSV public format, which serialized out a hierarchical database. They changed to relational database and serialized to an XML structure, and generated the CSV from that. Some users adopted this and converted it to their own XML structure along entirely different lines++, and ultimately generate other XML (including DOCX). The department released another XML format, and generate both the CSV and the first XML structure from that. Now they are moving to JSON, but in a flat arrangement quite like the early CSVs.
So we have at least 8 comprehensively different arrangements ("editions" as distinct from "revisions") of the same data, all for different purposes and stages, and with different SDLC justifications. Some are fully normalized, some are flat, some are shallow, some are deep. The /*/* roots are all different. Some of the flat formats have tables for entities and tables for relationships, some coalesce these. Some of the XML formats are designed to be RDF-friendly, others to be publication friendly. The publications from these (including legislation) are ordered and grouped differently. A particular issue is that items have not historically had unique identifiers: so when an item goes through a pipeline, it needs to be attached to all the keys used to identify it throughout the pipeline in order to triage that problematic downstream item [X,Y,Z] is the same as item[X,Y,Z] in the database: and this information has often been stripped out, requiring fuzzy matching.
So in this situation, the testing is ordered by risk into various levels of granularity and coverage:
* At the coarsest level, we just have raw counts: there should be 10,001 distinct Items of class X, there should be 8,000 unique notes, and so on, regardless of edition. This is a sanity check, and has the advantage that it can be extracted moderately readily from each radically different format. Furthermore, it can be published, potentially, and used by external users to verify their data. (The granularity of the items counted needs to cope with the grain needed for editions that subset the data.)
* Then we validate each individual format and try to find anomalies, in particular, orphaned data items (i.e. which have no references to them.) Schematron is quite good for this. When we find an anomaly in one format, we check for other instances of the same issue, and check other formats.
* Then, prioritized by risk, we do more fine-grained checks between editions. For this, I have been using yet another XML format: this is just the JSON de-normalized into a tree to remove as much deferencing and complex XPaths as we can, but so that results can be given in terms of the JSON names and (implicit) structures. So the converters from the various XML editions to this JSON-derived XML only include the high-risk areas we are testing.
* The intent is try to get as near to end-to-end testing as possible. To do this, it is sometimes not possible to do exact matching of the same item at each end. So for these cases, we do some broader tests: for example, does every phrase in appropriate cells in the DOCX appear anywhere in any text field in any JSON data? And vice versa.
To give an example of the kind of problems I have seen, at another job they changed XML formats in a pipeline from a format where lists could have lead and final paragraphs, to one without. So if you had a triple nested list and that final list had a terminal paragraph, then in the output it would be promoted up six levels (the list items and tags) and then down by a paragraph. So the structures were very different.
So, yes, converting to a common XML format that allows comparison is good, but the more that the structure and selection of data is different between the original data is different, the less easy is may be. The choice of pivot format may need to be determined by reporting requirements: how do the humans involved think about it?
Also, where you have editions with different selections or denormalizations, you need to consider whether you are just looking that everything in A has a match in B, or whether everything in B has a match in A. This can be much simpler than finding all differences.
Regards
Rick