Lists Home |
Date Index |
- From: Tony McDonald <email@example.com>
- To: firstname.lastname@example.org
- Date: Thu, 11 Feb 1999 08:54:58 +0000
>> From: "Rick Jelliffe" <email@example.com>
>> Date: Sun, 24 Jan 1999 16:15:36 +1100
>> Subject: Re: Word and XML (was: XML standards coherency and so forth)
>> From: Biron,Paul V <Paul.V.Biron@kp.ORG>
> Wow! I've been so busy lately that I haven't been able to keep up with
> XML-DEV and had no idea my "innocent" post on Word and HTML/XML had been so
> long lived!
> In truth, we've spent a great deal of time writting tools (a big daisy chain
> of FrontPage v1.1 -> hand-roled perl script 1 -> hand-roled perl script 2 ->
> etc.) just to HTML output from Word '97. What has made this all the more
> fustrating for us is that the HTML is not really what we want in the end.
> We just want a "clean" HTML version so that the transformation to the XML
> DTD that we're interested in is "easier". The BOLD and ITALIC that our
> authors see actually represent more "semantic" XML elements, e.g., <allergy>
> and <medication>. Such is life.
I don't know how far down this route you've gone Byron, but can I
suggest using rtf2xml (http://www.sesha.com/omlette/rtf2xml/) - it
uses the limited version of Omnimark http://www.omnimark.com as an
engine and does a very good job of RTF -> XML conversion.
It uses Word paragraph and character styles to convert the RTF into
well-formed and valid XML, eg
<p stylename="List Bullet"
& Administration Information </string><string charstyname="URL"
(you can see that additional, formatting, information that was in the
original Word document is provided too).
I then pass this through another omnimark program to get to (be aware
that it's perfectly possible to create invalid and badly-formed XML
at this stage!!);
<titleinfo class='subsubsection' level='3'>
<title class='subsubsection'>On-line Resources</title>
<sg_title>Organisation of Tissues</sg_title>
<subheading>Student Support and Tutoring (Computer Mediated
<item><text>Almanack & Administration Information
>From this XML, the conversion to another HTML (or RTF etc.) format is
I tried using the 'HTML' that Word 'emits' and had to have a lie
down...this scheme of using RTF and well marked up original documents
seems to be helping us along in our up-conversion process (whoever
chose that term knew what they were talking about - it's like
climbing, rather inching up, a vertical cliff face going backwards
with no ropes...great fun)
Dr Tony McDonald, FMCC, Networked Learning Environments Project
The Medical School, Newcastle University Tel: +44 191 222 5888
Fingerprint: 3450 876D FA41 B926 D3DD F8C3 F2D0 C3B9 8B38 18A2
xml-dev: A list for W3C XML Developers. To post, mailto:firstname.lastname@example.org
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:email@example.com the following message;
To subscribe to the digests, mailto:firstname.lastname@example.org the following message;
List coordinator, Henry Rzepa (mailto:email@example.com)