Lists Home |
Date Index |
On Mon, 2002-01-28 at 17:00, Jeff Greif wrote:
> It's also a question of volume. A 1% error rate that needs human cleanup is
> not a big deal when you only see 100 docs per day, but it mounts up when
> there a million.
Sure, if you start with a million docs and need to train all those
errors the first day. If you start with a hundred and add new
variations - remember, "errors" is not the right turn of phrase - over a
period of time, one manual mapping a day can lead to a lot of automated
processing after a few weeks.
> Analogy: A friend is slowly scanning and turning into PDF files all the
> reprints and preprints (in planetary science) that he's collected since the
> late 1960's. He runs the scanner more or less continuously while at home,
> and takes the files produced on his laptop when he travels, and does a sort
> of desultory fixup of the OCR (since he has the page images as well) as
> lulling airplane activity. Serious fixup occurs when he actually has to
> consult the paper for details.
OCR is a trickier case, since it pretty much requires a human to go over
everything to validate its content. I remember early systems (some on
the Mac, if I remember the early 90's well enough) which did ask people
for help with difficult characters along the way, but think that got
automated out in favor of batch processing in bulk.
Ring around the content, a pocket full of brackets
Errors, errors, all fall down!