Oh that’s interesting to note. We took a very similar approach with a Word-to-S1000D conversion tool that we wrote. We invented some intermediate “FlatXML” schema that resolved all the cruft from
the Word XML and provided a springboard for the S1000D data module outputs. The final process is: Word XML -> FlatXML -> S1000D data modules. For that project there was a fairly normalised set of inputs so we got away with hardcoding most of the style->tag mappings in the XSLT, but it should definitely be configurable as you suggest. One thing we did which I found very useful is implement a FlatXML -> HTML transformation. This let us quickly visualise problems in the Word source data (mis-styled or unstyled text, missing images,
hidden text fields, etc.). // Gareth Oakes // Chief Architect, GPSL // www.gpsl.co From: Eliot Kimber <ekimber@contrext.com> See
https://github.com/dita4publishers/org.dita4publishers.word2dita This is an XSLT2 framework for producing sophisticated DITA document sets from styled Word documents. The input is Word OOXML.
It makes heavy use of for-each-group to deal with the flat nature of OOXML. Basically the first thing I do is convert the OOXML into a vastly simplified, but still structurally flat, XML that I call “word processing markup language”. It
captures the essential information from the OOXML that is then needed to drive the remaining transformation. The transformation itself is driven by a style-to-tag map specification that defines the mapping from Word styles to DITA elements, including the hierarchical nesting of things. It not necessarily pretty XSLT code but it does work. I do appreciate that, as a standard, OOXML is at least documented more or less completely and that the markup design is fairly stable, as opposed to RTF, which was completely undocumented, unspecified, and inconsistent from Word version
to Word version. Cheers, E. -- Eliot Kimber http://contrext.com From: Rick Jelliffe <rjelliffe@allette.com.au> Michael Kay: But who would process OOXML using XSLT in that way? I have built several systems that generate OOXML, and one that reads it and substitutes some values, but I think XSLT (certainly
2.0) is often the wrong technology for complex processing of OOXML inputs, for example because of the flatness, the ZIP, MCE, versions, and the relationships files adds to the indirection. I don't think support of expressing "semantic relationships" was ever
a goal for OOXML (especially since the i4i case when MS had to disable some XML support).
Doesn't all it means in a general purpose language with JSON is that you would have to adopt a particular programming pattern when iterating throught the JSON tree: you maintain your visitation
stack to allow parent::* access and make indexes to allow keyed refences? I know what you mean by bottom-up versus top-down, and I am not sure that is exactly the case. The original XML format for Word 2003 was top-down and more like early ODF-like (like a neater RTF
in XML), but when the got down to the nitty gritty it got unworkable to proceed, so they had to start again. What they did the second time around was have a stronger top-down design patterns [Open Packaging/ZIP, macros (Markup Compatability and Extenions),
versions, relationships, separation of concerns with stylesheets and graphics etc in separate files, the properties pattern of attributes) and then tried to pour the their binary format into that, top-down. It may look like bottom-up chaos if you are just
expecting a single file, but is systematic. (And then, the way of all flesh, when these extractions failed or were not developed in time, you ended up with lots of bottom-up carbuncles. SNAFU.) I think we have corresponded before that I think XSD should not even be classed as a "web" technology because it does not allow validation of webs of documents-- it is a file technology: XSLT
1 at least had the document() function, and XSLT 3 has a much stronger story as a web technology with xsl:source-document and xsl:collection etc. Consequently when you get to something like OOXML, the schemas provide no validation between documents in what
is a highly linked collection. Murata-san: Oh, I am not suggesting OOXML be replaced by JSON now! Yikes!
It seems there is a schema language for JSON, JSchema, and there is a converter from XSD to JSchema. And, yes, a large data structure or document needs some method of validation. The work and
information required would not be much different. But does anyone who implements OOXML consumer applications actually read it into a DOM with the XSD and use the PSVI? (Or does anyone use the schemas for dynamic
data binding?) I suspect developers would use the schemas to generate code (i.e stub classes for import functions) and then maintain the code by hand. Do you have a feel for this?
Regards Rick |