OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Re: Structured from/within unstructured documents

On Tue, 18 Dec 2007 10:17:31 +1100, Marcus Carr wrote
> Stephen Green wrote:
> > What methods are there, these days, for extracting structured data from
> > unstructured documents (such as PDF)?
> Maybe I'm missing something, but I didn't see anyone suggest saving 
> the PDF as XML straight from Acrobat. If you have a full licence, it 

To be honest I've not looked at it for years-- I don't have Acrobat,
only the reader--- but, if I recall, the "save as XML" functionality was
part of their XML-architecture (borrowed, I think, from  Framemaker+SGML
which I do have). This means that either the data was pre-tagged or one
defined a appropriate mapping table. This can be good in a controlled
environment where one is converting from existing documentation in a
consistent corporate style but is ill-suited for conversion of the typical
wild-west mix that most companies tend to have. With the effort, I think,
one is better off using old school mission designed tools in the spirit
of Omnimark or something like ClearForest or any of a number of auto-tagging
and content categorization solutions in between. Its an industry with a host
of companies specialized in the conversion of data to XML using these and
their own proprietary tools. 

As I wrote earlier: sucking in the metadata and marking up sentences,
paragraphs and pages can be done with good quality in a relative generic
manner (sufficiently adequate I found to be applied for all purpose PDF
indexing). You really need to decide what you need.

> does a pretty respectable job, getting you paragraph and character 
> tagging, tables and images. You can also batch process, converting 
> entire directories or what have you. The results are at least as 
> good as saving the PDF to something like Word first and you could be 
> forgiven for expecting that they might even be better.

Using Word as in-between is like flying through Mogadishu to get to
Los Angeles from Boston. It may get you there but chances are that you'll
loose some luggage.


Edward C. Zimmermann, Basis Systeme netzwerk, Munich
Office Leo (R&D):
   Leopoldstrasse 53-55, D-80802 Munich,
   Federal Republic of Germany

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS