XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Re: Structured from/within unstructured documents

On Tue, 18 Dec 2007 10:17:31 +1100, Marcus Carr wrote
> Stephen Green wrote:
> 
> > What methods are there, these days, for extracting structured data from
> > unstructured documents (such as PDF)?
> 
> Maybe I'm missing something, but I didn't see anyone suggest saving 
> the PDF as XML straight from Acrobat. If you have a full licence, it 

To be honest I've not looked at it for years-- I don't have Acrobat,
only the reader--- but, if I recall, the "save as XML" functionality was
part of their XML-architecture (borrowed, I think, from  Framemaker+SGML
which I do have). This means that either the data was pre-tagged or one
defined a appropriate mapping table. This can be good in a controlled
environment where one is converting from existing documentation in a
consistent corporate style but is ill-suited for conversion of the typical
wild-west mix that most companies tend to have. With the effort, I think,
one is better off using old school mission designed tools in the spirit
of Omnimark or something like ClearForest or any of a number of auto-tagging
and content categorization solutions in between. Its an industry with a host
of companies specialized in the conversion of data to XML using these and
their own proprietary tools. 

As I wrote earlier: sucking in the metadata and marking up sentences,
paragraphs and pages can be done with good quality in a relative generic
manner (sufficiently adequate I found to be applied for all purpose PDF
indexing). You really need to decide what you need.

> does a pretty respectable job, getting you paragraph and character 
> tagging, tables and images. You can also batch process, converting 
> entire directories or what have you. The results are at least as 
> good as saving the PDF to something like Word first and you could be 
> forgiven for expecting that they might even be better.

Using Word as in-between is like flying through Mogadishu to get to
Los Angeles from Boston. It may get you there but chances are that you'll
loose some luggage.


--

Edward C. Zimmermann, Basis Systeme netzwerk, Munich
Office Leo (R&D):
   Leopoldstrasse 53-55, D-80802 Munich,
   Federal Republic of Germany
http://www.nonmonotonic.net



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS