Re: Structured from/within unstructured documents

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: "Dimitre Novatchev" <dnovatchev@yahoo.com>
To: xml-dev@lists.xml.org
Date: Sat, 15 Dec 2007 12:03:29 -0800

Any instance of an LR(1) language can be processed in pure XSLT and one 
possible result can be to produce an xml document.

See for example the
   json-document()
function of FXSL. This function uses the generic LR(1) parsing system of 
FXSL: the lr-parse() function.

More information can be found here:

   http://dnovatchev.spaces.live.com/Blog/cns!44B0A32C2CCF7488!367.entry

   http://www.stylusstudio.com/xsllist/200711/post20640.html


Cheers,
Dimitre Novatchev


"Stephen Green" <stephengreenubl@gmail.com> wrote in message 
92040e120712151004n13dec762x770cbe02afa1abb8@mail.gmail.com">news:92040e120712151004n13dec762x770cbe02afa1abb8@mail.gmail.com...
> What methods are there, these days, for extracting structured data from
> unstructured documents (such as PDF)?
>
> I'm aware it is quite straightforward to extract data from semi-structured
> documents such as spreadsheets (as previous XML-Dev discussions have
> shown, such as via ODF with XSLT and macros/Ant/Ant Contrib, etc).
>
> As yet, the only way I'm aware of for doing the same from PDF would be to
> print out to paper and use OCR (sounds a little ridiculous) or maybe to
> convert PDF, etc to some XML-based or other text-based print/archive
> file somehow and go from there (perhaps with something akin to a screen-
> scraper?).
>
> Is this all there is?
>
> Plus how does one then convert the data as, say XML into some XML
> or equivalent document and embed that in, say, the PDF or equivalent
> unstructured document file (for later extraction, say)?
> I'd very much appreciate any light on this. Thank you. I'm interested not
> so much in metadata but actual data or full structured equivalents of the
> unstructured documents rather than just enough data to create an index.
>
> E.g what about patient records held in PDF and in XML formats and how
> to turn the first into the latter and/or embed the latter in the first.
>
> Best regards
>
> -- 
> Stephen Green
>
> Partner
> SystML, http://www.systml.co.uk
> Tel: +44 (0) 117 9541606
>
> http://www.biblegateway.com/passage/?search=matthew+22:37 .. and voice
>
> _______________________________________________________________________
>
> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
> to support XML implementation and development. To minimize
> spam in the archives, you must subscribe before posting.
>
> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
> subscribe: xml-dev-subscribe@lists.xml.org
> List archive: http://lists.xml.org/archives/xml-dev/
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>
>

References:
- Structured from/within unstructured documents
  - From: "Stephen Green" <stephengreenubl@gmail.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]