[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Re: Structured from/within unstructured documents
- From: "Edward C. Zimmermann" <edz@bsn.com>
- To: Marcus Carr <mcarr@allette.com.au>, xml-dev@lists.xml.org
- Date: Tue, 18 Dec 2007 12:46:55 +0100
On Tue, 18 Dec 2007 10:17:31 +1100, Marcus Carr wrote
> Stephen Green wrote:
>
> > What methods are there, these days, for extracting structured data from
> > unstructured documents (such as PDF)?
>
> Maybe I'm missing something, but I didn't see anyone suggest saving
> the PDF as XML straight from Acrobat. If you have a full licence, it
To be honest I've not looked at it for years-- I don't have Acrobat,
only the reader--- but, if I recall, the "save as XML" functionality was
part of their XML-architecture (borrowed, I think, from Framemaker+SGML
which I do have). This means that either the data was pre-tagged or one
defined a appropriate mapping table. This can be good in a controlled
environment where one is converting from existing documentation in a
consistent corporate style but is ill-suited for conversion of the typical
wild-west mix that most companies tend to have. With the effort, I think,
one is better off using old school mission designed tools in the spirit
of Omnimark or something like ClearForest or any of a number of auto-tagging
and content categorization solutions in between. Its an industry with a host
of companies specialized in the conversion of data to XML using these and
their own proprietary tools.
As I wrote earlier: sucking in the metadata and marking up sentences,
paragraphs and pages can be done with good quality in a relative generic
manner (sufficiently adequate I found to be applied for all purpose PDF
indexing). You really need to decide what you need.
> does a pretty respectable job, getting you paragraph and character
> tagging, tables and images. You can also batch process, converting
> entire directories or what have you. The results are at least as
> good as saving the PDF to something like Word first and you could be
> forgiven for expecting that they might even be better.
Using Word as in-between is like flying through Mogadishu to get to
Los Angeles from Boston. It may get you there but chances are that you'll
loose some luggage.
--
Edward C. Zimmermann, Basis Systeme netzwerk, Munich
Office Leo (R&D):
Leopoldstrasse 53-55, D-80802 Munich,
Federal Republic of Germany
http://www.nonmonotonic.net
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]