[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: Structured from/within unstructured documents
- From: "Dimitre Novatchev" <dnovatchev@yahoo.com>
- To: xml-dev@lists.xml.org
- Date: Sat, 15 Dec 2007 12:03:29 -0800
Any instance of an LR(1) language can be processed in pure XSLT and one
possible result can be to produce an xml document.
See for example the
json-document()
function of FXSL. This function uses the generic LR(1) parsing system of
FXSL: the lr-parse() function.
More information can be found here:
http://dnovatchev.spaces.live.com/Blog/cns!44B0A32C2CCF7488!367.entry
http://www.stylusstudio.com/xsllist/200711/post20640.html
Cheers,
Dimitre Novatchev
"Stephen Green" <stephengreenubl@gmail.com> wrote in message
92040e120712151004n13dec762x770cbe02afa1abb8@mail.gmail.com">news:92040e120712151004n13dec762x770cbe02afa1abb8@mail.gmail.com...
> What methods are there, these days, for extracting structured data from
> unstructured documents (such as PDF)?
>
> I'm aware it is quite straightforward to extract data from semi-structured
> documents such as spreadsheets (as previous XML-Dev discussions have
> shown, such as via ODF with XSLT and macros/Ant/Ant Contrib, etc).
>
> As yet, the only way I'm aware of for doing the same from PDF would be to
> print out to paper and use OCR (sounds a little ridiculous) or maybe to
> convert PDF, etc to some XML-based or other text-based print/archive
> file somehow and go from there (perhaps with something akin to a screen-
> scraper?).
>
> Is this all there is?
>
> Plus how does one then convert the data as, say XML into some XML
> or equivalent document and embed that in, say, the PDF or equivalent
> unstructured document file (for later extraction, say)?
> I'd very much appreciate any light on this. Thank you. I'm interested not
> so much in metadata but actual data or full structured equivalents of the
> unstructured documents rather than just enough data to create an index.
>
> E.g what about patient records held in PDF and in XML formats and how
> to turn the first into the latter and/or embed the latter in the first.
>
> Best regards
>
> --
> Stephen Green
>
> Partner
> SystML, http://www.systml.co.uk
> Tel: +44 (0) 117 9541606
>
> http://www.biblegateway.com/passage/?search=matthew+22:37 .. and voice
>
> _______________________________________________________________________
>
> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
> to support XML implementation and development. To minimize
> spam in the archives, you must subscribe before posting.
>
> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
> subscribe: xml-dev-subscribe@lists.xml.org
> List archive: http://lists.xml.org/archives/xml-dev/
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>
>
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]