Re: [xml-dev] Re: Structured from/within unstructured documents

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Marcus Carr <mcarr@allette.com.au>
To: "Edward C. Zimmermann" <edz@bsn.com>
Date: Wed, 19 Dec 2007 10:05:44 +1100

Edward C. Zimmermann wrote:

> To be honest I've not looked at it for years-- I don't have Acrobat,
> only the reader--- but, if I recall, the "save as XML" functionality
> was part of their XML-architecture (borrowed, I think, from 
> Framemaker+SGML which I do have). This means that either the data was
> pre-tagged or one defined a appropriate mapping table.

Nope, it's a simple "save as". If the document was tagged, it will use 
whatever information as it has about styles, but it doesn't depend on 
tagging to produce valid XML.

> This can be good in a controlled environment where one is converting
> from existing documentation in a consistent corporate style but is
> ill-suited for conversion of the typical wild-west mix that most
> companies tend to have.

My objective is always to get out any proprietary file format and into 
some form of XML as quickly as possible, then assess what I've got and 
move forward from there. You have to tame the data somewhere - I prefer 
to do it in XML but before my target structure.

> With the effort, I think, one is better off using old school mission 
> designed tools in the spirit of Omnimark or something like 
> ClearForest or any of a number of auto-tagging and content 
> categorization solutions in between.

Yep, I started coding with OmniMark when it was still XTran from 
Exoterica and it's a great tool. If you're doing whatever the modern 
equivalent is to a cross- or up-translate though, the additional 
information in the form of the XML tagging is only going to assist, I 
would have thought. I don't see the two approaches as being incompatible 
at all.

> Its an industry with a host of companies specialized in the
> conversion of data to XML using these and their own proprietary
> tools.

Sure, if you're willing to send your data out you don't care what tools 
they're using, but that wasn't what the original poster was after.

> As I wrote earlier: sucking in the metadata and marking up sentences,
> paragraphs and pages can be done with good quality in a relative
> generic manner (sufficiently adequate I found to be applied for all
> purpose PDF indexing). You really need to decide what you need.

Agreed - identifying sentences and pages is a very different task to 
anything more concentrated on the information.

> Using Word as in-between is like flying through Mogadishu to get to 
> Los Angeles from Boston. It may get you there but chances are that
> you'll loose some luggage.

I wasn't advocating it, but saving as RTF from PDF and then going to XML 
was mentioned. I was offering an alternative.

Marcus

References:
- Re: Structured from/within unstructured documents
  - From: Marcus Carr <mcarr@allette.com.au>
- Re: [xml-dev] Re: Structured from/within unstructured documents
  - From: "Edward C. Zimmermann" <edz@bsn.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]