XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Structured from/within unstructured documents

Many thanks for helpful answers

I guess what would be particularly helpful would be an API or equivalent.
Anyone heard of the like? Perhaps a tool with a scripting language even?
To allow lots of documents to be converted

On 16/12/2007, Greg Hunt <greg@firmansyah.com> wrote:
> Stephen,
>  If the data is critical, then you should look at the specific source
> documents and their origins and confirm for yourself whether any particular
> tool has a low-enough error rate for the population of source documents that
> you have to deal with.  It is always possible to create documents that
> cannot be converted; the question is whether you have to deal with them.
>
>  Greg
>
> On 12/16/07, Stephen Green <stephengreenubl@gmail.com> wrote:
> > I notice there are commercial tools advertised to convert PDF to .doc
> > or .odt, etc
> > or to extract data in one way or another. How reliable do people find
> > such tools?
> > Is it realistic yet to be extracting data and converting it to, say,
> > XML documents
> > in large volumes and with crucial data such a financial, technical or
> medical
> > records?
> >
> > On 16/12/2007, Edward C. Zimmermann <edz@bsn.com> wrote:
> > > On Sun, 16 Dec 2007 18:15:05 +1100, Greg Hunt wrote
> > > > Stephen,
> > > > The problem with processing the physical PDF file is precisely its
> > > presentation orientation.
> > >
> > > You have to render PDF (at least internally into a buffer). Its a format
> with
> > > graphical "language" not totally unlike (and built-upon) PostScript
> where,
> > > among a host of features, each individual character can be positioned.
> > >
> > > Popular "freely" available PDF tools that can be used to "extract text"
> > > are, among others, Adobe's Acrobat Reader, Derek Noonburg's Xpdf,
> Poppler
> > > and Ghostscript. M$ Windows includes a "filter mechanism" called iFilter
> > > for their own search. It includes apparently, among others, a filter
> > > supplied by Adobe intended for the extraction of text from PDF.
> > >
> > > > A perverse document can mix image and text or even embed the text in
> the
> > > reverse order that it would be displayed in.
> > >
> > > Not really that wholly uncommon--- calculating glymph position from the
> right.
> > >
> > > >
> > >
> > > In rendering, however, you need or want to keep paragraph blocks
> together
> > > and **not** (as the case from a "screen scrape" of a display rendered
> page)
> > > preserve the columns and visual flow elements as these not only make it
> > > much more difficult to extract simple things like sentences but also
> > > don't deliver any contextual information. That a text was set in two
> column
> > > with a center picture is a result of its chosen style and not content
> > > structure--- recall that different output devices can look different.
> > > Linking structural semantics for many of these style elements is
> tenuous.
> > >
> > >
> > >
> > > --
> > >
> > >  Edward C. Zimmermann, Basis Systeme netzwerk, Munich
> > >  Office Leo (R&D):
> > >    Leopoldstrasse 53-55, D-80802 Munich,
> > >    Federal Republic of Germany
> > >  http://www.nonmonotonic.net
> > >
> > >
> >
> >
> > --
> > Stephen Green
> >
> > Partner
> > SystML, http://www.systml.co.uk
> > Tel: +44 (0) 117 9541606
> >
> > http://www.biblegateway.com/passage/?search=matthew+22:37
> .. and voice
> >
>
>


-- 
Stephen Green

Partner
SystML, http://www.systml.co.uk
Tel: +44 (0) 117 9541606

http://www.biblegateway.com/passage/?search=matthew+22:37 .. and voice


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS