XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] How to avoid (minimize) errors due to copying, pasting,and transcribing?

You problem is how to enforce the desired invariants between a source and target data set.

I have used Schematron for this on several large (hundreds of thousands) of documents. For example, one invariant might be that the source and target documents have the same number of headings.  

Or that every unique hexagram made of the last 3 visible characters in a paragraph plus the first three of the next immediately following paragraph of the source should also be found in the target: this checks dropped or reordered paras even if the transformation has restructured branches. These problems are the bane of ETL tools used for XML. Serious problems.

The first thing is classify the severity and likelihood if known: SEV1, SEV2 etc.  Then write tests: a high severity needs good tests, but an unlikely low risk may just need a canary in a cage: alloeing false positives for simplicity. 
Detetmibe what failure rate is acceptable (e.g., 0 x SEV 1, 0 xSEV2, 1 xSEV3 error per 100 documents,)

If you have a large corpus, decide how many to look at: select randomly. If you have fewer than 10,000 documents in the source corpus, test them all. If you have more than 10,000 documents, randomly select 10,000 to process and check.  (Or use a stats calculator to see the kinds of sizes you need for 99.9% confidence etc.) 

Anothet way is to round-trip the document back to your source format, then compare the difference. That makes the assertions easier, but the transform might be harder, and it deoends on what information is lost up or down.

Regards
Rick

On Wed, 23 May 2018, 17:08 Hans-Juergen Rennau, <hrennau@yahoo.de> wrote:
Hi Roger, dividing the problem into creating and checking resources, and focusing on the second, I think the magic word is *structured information*. Unfortunately, the awareness of structured information and their potential usefulness is very low. Or let me be more precise: the awareness of chances to use structured information creatively, spontaneously, inventively, in response to you needs of quality assurance, rather than along the trodden and obvious paths.

To illustrate the thought: imagine a specification written in docbook, and a CSV file compiling some data paths in the second column. The following XQuery (using an extension function offered by BaseX)

let $pathExpected := unparsed-text('paths.csv') ! csv:parse(.)//record/entry[2]
let $pathFound := doc("rethinking13.xml")/descendant::*:table[@xml:id eq 'paths']//*:row/*:entry[1]/string()
return $pathExpected[not(. = $pathFound)]/string()

gives me all paths found in the CSV, but forgotten in the docbook table. I do not think many people would have recognized this possibility, although there is a docbook file and a CSV file. So part one of an attempt at an answer is: SEE the structured information which is there.

While part 2 is: ADD it, where it isn't.

The rest is XQuery, or any other language speaking structured information as found in resources, natively.

With kind regards,
Hans-Jürgen

Am Donnerstag, 17. Mai 2018, 13:59:41 MESZ hat Costello, Roger L. <costello@mitre.org> Folgendes geschrieben:


Hi Folks,

I am working on a project that has created a large, complex data specification. There are tables in the data specification, from which I created Schematron rules. The tables specify a bunch of codes. When I created the Schematron rules, I accidentally missed some of the codes. I discovered this omission only after considerable effort and expense.

It got to thinking about all the other places along the path to creating the data specification where data might have accidentally been dropped, altered, added, or put in the wrong place. I don't know, but I suspect the data specification was produced something like this: several subject matter experts jotted down some ideas on a piece of paper and handed it to another person who typed up their ideas. [Potential for errors at this step] The typed document then goes to a publication office which typesets and officially publishes the data specification. [Potential for errors at this step] Then, of course people use the data specification in their own endeavors, which provides more opportunities where errors may be introduced.

It occurs to me that quite possibly lots of errors are due to simple human errors from copying, pasting, transcribing. How to avoid this?

/Roger



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS