OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Structured from/within unstructured documents

If the data is critical, then you should look at the specific source documents and their origins and confirm for yourself whether any particular tool has a low-enough error rate for the population of source documents that you have to deal with.  It is always possible to create documents that cannot be converted; the question is whether you have to deal with them.


On 12/16/07, Stephen Green <stephengreenubl@gmail.com> wrote:
I notice there are commercial tools advertised to convert PDF to .doc
or .odt, etc
or to extract data in one way or another. How reliable do people find
such tools?
Is it realistic yet to be extracting data and converting it to, say,
XML documents
in large volumes and with crucial data such a financial, technical or medical

On 16/12/2007, Edward C. Zimmermann <edz@bsn.com> wrote:
> On Sun, 16 Dec 2007 18:15:05 +1100, Greg Hunt wrote
> > Stephen,
> > The problem with processing the physical PDF file is precisely its
> presentation orientation.
> You have to render PDF (at least internally into a buffer). Its a format with
> graphical "language" not totally unlike (and built-upon) PostScript where,
> among a host of features, each individual character can be positioned.
> Popular "freely" available PDF tools that can be used to "extract text"
> are, among others, Adobe's Acrobat Reader, Derek Noonburg's Xpdf, Poppler
> and Ghostscript. M$ Windows includes a "filter mechanism" called iFilter
> for their own search. It includes apparently, among others, a filter
> supplied by Adobe intended for the extraction of text from PDF.
> > A perverse document can mix image and text or even embed the text in the
> reverse order that it would be displayed in.
> Not really that wholly uncommon--- calculating glymph position from the right.
> >
> In rendering, however, you need or want to keep paragraph blocks together
> and **not** (as the case from a "screen scrape" of a display rendered page)
> preserve the columns and visual flow elements as these not only make it
> much more difficult to extract simple things like sentences but also
> don't deliver any contextual information. That a text was set in two column
> with a center picture is a result of its chosen style and not content
> structure--- recall that different output devices can look different.
> Linking structural semantics for many of these style elements is tenuous.
> --
>  Edward C. Zimmermann, Basis Systeme netzwerk, Munich
>  Office Leo (R&D):
>    Leopoldstrasse 53-55, D-80802 Munich,
>    Federal Republic of Germany
>  http://www.nonmonotonic.net

Stephen Green

SystML, http://www.systml.co.uk
Tel: +44 (0) 117 9541606

http://www.biblegateway.com/passage/?search=matthew+22:37 .. and voice

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS