OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Structured from/within unstructured documents

Stephen Green wrote:
> What methods are there, these days, for extracting structured data from
> unstructured documents (such as PDF)?
> [!!! SNIP !!!]
> Is this all there is?

Microsoft Word and Open Office both export to XML, and Antiword is a 
program that does a pretty good job of extracting Word files to DocBook.

For PDF, though, I don't know of any really good tools. The following 
page, from someone who has played with the problem, gives a summary of 
what's out there:


I'd love it if someone would tell me there's something actively 
maintained that does this job in the open source world. I don't know it yet.

Red Hat Enterprise MRG: http://www.redhat.com/mrg/

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS