Re: [xml-dev] Structured from/within unstructured documents

XML.org

XML.org

FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Re: [xml-dev] Structured from/within unstructured documents

From: "Greg Hunt" <greg@firmansyah.com>
To: "Stephen Green" <stephengreenubl@gmail.com>
Date: Sun, 16 Dec 2007 18:15:05 +1100

Stephen,
The problem with processing the physical PDF file is precisely its presentation orientation. It is possible to have a PDF that contains multiple instances of a paragraph with the physical order of those paragraphs not necessarily being chronological order (you have to process the PDF format in order to work out which instance is live). If you go down a parsing route and want complete reliability, you have to interpret the PDF file structure before you can linearise it so that it can be parsed (obviously this is not the case with linearised PDFs as defined by the Adobe PDF reference). In that sense I am not sure that LR(1) is going to help without a very, very complex lexer.

A perverse document can mix image and text or even embed the text in the reverse order that it would be displayed in.

A simple parsing approach will work most of the time, but not all of the time. PDF is a fun formatbut extraction of text with 100% reliability requires the equivalent of an Acrobat reader to do the parsing and placement of glyphs on the page. Have a look in the Adobe PDF reference for the gorey details.

Greg

On 12/16/07, Stephen Green <stephengreenubl@gmail.com> wrote:

Many thanks for these very prompt and useful pointers.

Looks like tools like Abbyy and LR(1) as a technology are
potential ways to go. I hope there are others too or that
others are developed soon to fill an obvious gap.

So parsing or OCR'ing the essentially visual representation
of unstructured data and documents make sense as the
first step toward structured documents, lossy though these
methods are likely, it seems, to be.

I guess I was hoping we were further ahead. So it seems
so far output to PDF, etc are one-way, dead-end streets.
Pity.

I note you can highlight text in a PDF in readers and copy
it to clipboard. Maybe tools based on such methods exist
for creating XML. Maybe one pass would create a template,
say, and then further documents of the same format (such
as in a form) could be handled automatically based on the
template - like OCR but adapted to natively handle electronic
paper. Any tools like that already which can output XML?

Best regards and thanks for these and any further pointers.

--
Stephen Green

Partner
SystML, http://www.systml.co.uk
Tel: +44 (0) 117 9541606

http://www.biblegateway.com/passage/?search=matthew+22:37 .. and voice

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php

Follow-Ups:
- Re: [xml-dev] Structured from/within unstructured documents
  - From: "Edward C. Zimmermann" <edz@bsn.com>

References:
- Structured from/within unstructured documents
  - From: "Stephen Green" <stephengreenubl@gmail.com>
- Re: [xml-dev] Structured from/within unstructured documents
  - From: Jonathan Robie <jonathan.robie@redhat.com>
- Re: [xml-dev] Structured from/within unstructured documents
  - From: "Stephen Green" <stephengreenubl@gmail.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS