[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Structured from/within unstructured documents
- From: "Edward C. Zimmermann" <edz@bsn.com>
- To: "Greg Hunt" <greg@firmansyah.com>, "Stephen Green" <stephengreenubl@gmail.com>
- Date: Sun, 16 Dec 2007 11:43:02 +0100
On Sun, 16 Dec 2007 18:15:05 +1100, Greg Hunt wrote
> Stephen,
> The problem with processing the physical PDF file is precisely its
presentation orientation.
You have to render PDF (at least internally into a buffer). Its a format with
graphical "language" not totally unlike (and built-upon) PostScript where,
among a host of features, each individual character can be positioned.
Popular "freely" available PDF tools that can be used to "extract text"
are, among others, Adobe's Acrobat Reader, Derek Noonburg's Xpdf, Poppler
and Ghostscript. M$ Windows includes a "filter mechanism" called iFilter
for their own search. It includes apparently, among others, a filter
supplied by Adobe intended for the extraction of text from PDF.
> A perverse document can mix image and text or even embed the text in the
reverse order that it would be displayed in.
Not really that wholly uncommon--- calculating glymph position from the right.
>
In rendering, however, you need or want to keep paragraph blocks together
and **not** (as the case from a "screen scrape" of a display rendered page)
preserve the columns and visual flow elements as these not only make it
much more difficult to extract simple things like sentences but also
don't deliver any contextual information. That a text was set in two column
with a center picture is a result of its chosen style and not content
structure--- recall that different output devices can look different.
Linking structural semantics for many of these style elements is tenuous.
--
Edward C. Zimmermann, Basis Systeme netzwerk, Munich
Office Leo (R&D):
Leopoldstrasse 53-55, D-80802 Munich,
Federal Republic of Germany
http://www.nonmonotonic.net
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]