OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Re: Structured from/within unstructured documents

Sounds good. Thanks Marcus.

On 17/12/2007, Marcus Carr <mcarr@allette.com.au> wrote:
> Stephen Green wrote:
> > What methods are there, these days, for extracting structured data from
> > unstructured documents (such as PDF)?
> Maybe I'm missing something, but I didn't see anyone suggest saving the
> PDF as XML straight from Acrobat. If you have a full licence, it does a
> pretty respectable job, getting you paragraph and character tagging,
> tables and images. You can also batch process, converting entire
> directories or what have you. The results are at least as good as saving
> the PDF to something like Word first and you could be forgiven for
> expecting that they might even be better.
> Once you're that far, you can get on your XSLT boots...
> Marcus
> _______________________________________________________________________
> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
> to support XML implementation and development. To minimize
> spam in the archives, you must subscribe before posting.
> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
> subscribe: xml-dev-subscribe@lists.xml.org
> List archive: http://lists.xml.org/archives/xml-dev/
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php

Stephen Green

SystML, http://www.systml.co.uk
Tel: +44 (0) 117 9541606

http://www.biblegateway.com/passage/?search=matthew+22:37 .. and voice

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS