OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Ensuring samples are representative

You are right to be concerned.

Clients rarely know or appreciate how much variation exists in their document set.

If the documents are known to be completely uniform, then a single representative example may be fine. If the documents vary, then a single sample is disastrously wrong-headed.

And the worst approach is for the client to say something like: we will provide you with some small number of example documents to do your work, and we will test against some other randomly selected small sample.

Honestly, I think the best approach is to assume that variation is normally distributed and use vanilla statistic techniques for estimating the sample size of documents you need to use. **

My rule of thumb is that for a document corpus of fewer than 1,000 documents, you may as well test them all. At the other end, there is severely diminishing returns on checking more than 6,000 documents even for a large corpus.  (These numbers came from playing with sample size estimators.)

A BDD approach (or a TDD approach) is really useful: the question is not "how do I convert these files" but "how do I prove to management (or to myself) that the conversions work?" 

Four other approaches:

* six sigma. Do a kind of function point analysis on document structures, and then estimate what the acceptable failure rate will be.

* 100% sample. I like this approach best. It requires some infrastructure, such as a Schematron to compare the invariants beyween the input and output document,and some running and reporting framework. But no more than a unit testing or rehresion yesting framework may provide and require aleady...  Indeed sometimes the errors are *worse* than normally distributed: sometimes you have large numbers of errors each occuring only a small number of times. By the time you have a large enough sample, you may as well have tested 100%.

* TQM. if your corpus is being added to, you cannot guarantee that new errors will not crop up. What QC mechanism is in place? What QA mechanism? GIGO should not be forgotten too. You need to have a clear agreement about what needs to be fixed in markup or manually. You cannot perform miracles.

* Smarter. A few years ago I worked on a large, long running corpus. We wrote some analyis tools and found there were over 17,000 unique absolute  Xpaths /a/b/c, just for elements. We made a corpus of sample documents so that there was a represEntative of each XPath. (This worked well on one project, but another project it failed: we needed to include context: /a/b/c[preceding-sibling::*[1][self::e]]  because there was a lot of moving data between branches.


** If the small sample size is unavoidable, then you use the statistics to say to the client  "ok, you have provided 7 documents as a sample from the 1,000 strong corpus, so if the variation is normally distributed, then we might expect that at best 50% of documents will go through conversion Ok."  (Actually, to an extent blythely using normal distribution is mumbo jumbo: but the point is to show that one method of estimating --which may or may not be applicable here but perhaps sets a middle point-- indicates that small sample sizes are quite ineffective. If you have the knowledge, use a more appropriate statistical technique!) If the customer is prepared to go ahead on that basis, it would not be uncommon: frequently people hope they get lucky. But you don't want to penalized for their gambling habit!

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS