OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   XML cleanup for Word 2K documents

[ Lists Home | Date Index | Thread Index ]
  • From: "Simon St.Laurent" <simonstl@simonstl.com>
  • To: xml-dev@lists.xml.org
  • Date: Mon, 31 Jul 2000 12:24:23 -0400

At 08:12 AM 7/31/00 -0700, Chris Lovett wrote:
>General XML authoring was not a stated goal.  It does however embed some
>islands of well-formed XML inside the HTML pages.  This is intended for
>Office use only.  If you can figure out how to post-process the HTML to
>extract and manipulate this XML then more power to you.

It may be a bit early to announce, but since everyone's talking about it...


I've been working on exactly such a post-processor, which filters Word 2K
(and I think Office 2K) files before they go through an XML parser.
Technically, it's a Java FilterReader.  By wrapping your parser input in
this filter, you allow the code to track the bytes as they come in, making
the necessary syntactical modifications to turn them into legitimate XML

Apart from a few empty HTML elements, Word does a pretty good job of
presenting clean structures in its HTML output, but not clean syntax.
Pretty good, of course, doesn't make it XML, but that's what this filter is

This isn't a general XHTML clean-up program like Tidy - it only works for
O2K files, and may introduce problems in well-formed XHTML documents.  It
preserves all of the information stored in the O2K file, including the
strange conditionals Microsoft uses, though these are converted into an
element with an attribute.

The filter doesn't remove any of Microsoft's XML or HTML, leaving it all
there for later processing with XSLT, the DOM, or the XML tool of your choice.

You don't need to have Office 2000 to use the code - it only requires Java
1.1 or higher.  I haven't tested it extensively, just a few dozen Word
files, but so far it seems to do okay.  It's not pretty code, though I'm
cleaning and documenting as I find time.  I only got Word 2K last week, so
this is early, probably too early.

Test reports are welcome, as are code contributions.  

Simon St.Laurent
XML Elements of Style / XML: A Primer, 2nd Ed.
http://www.simonstl.com - XML essays and books


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS