[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Structured from/within unstructured documents
- From: Jonathan Robie <jonathan.robie@redhat.com>
- To: Stephen Green <stephengreenubl@gmail.com>
- Date: Sat, 15 Dec 2007 14:31:31 -0500
Stephen Green wrote:
> What methods are there, these days, for extracting structured data from
> unstructured documents (such as PDF)?
>
> [!!! SNIP !!!]
>
> Is this all there is?
>
Microsoft Word and Open Office both export to XML, and Antiword is a
program that does a pretty good job of extracting Word files to DocBook.
For PDF, though, I don't know of any really good tools. The following
page, from someone who has played with the problem, gives a summary of
what's out there:
http://discerning.com/hacks/docutils/pdf2xml/readme.html
I'd love it if someone would tell me there's something actively
maintained that does this job in the open source world. I don't know it yet.
Jonathan
Red Hat Enterprise MRG: http://www.redhat.com/mrg/
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]