Re: [xml-dev] Was OOXML's problem that it should have used JSON notXML?

Gareth—generating HTML from the simplified XML is an interesting idea that I had not considered.

The approach I took for debugging was to capture the XPath in the OOXML source of the paragraph that ultimately resulted in some DITA element in the final output.

From that I could then go back to the original Word using a little scripting in order to highlight the original source of the error.

Or rather, I could have if there wasn’t a design flaw in Xerces such that it reports the *parent* of the invalid element rather than the invalid element itself (a design flaw corrected in OxygenXML’s patched version of Xerces).

Since the parent of all paragraphs in Word is the document, this turned out to be not very useful.

For my primary driving use case, which was transforming manuscripts for books and magazine articles into DITA, the Word documents are highly controlled, usually by an editor whose job it is to prepare the Word from whatever the author has provided, so debugging was less of an issue in practice.

Cheers,

Eliot

Eliot Kimber

http://contrext.com

From: Gareth Oakes <goakes@gpsl.co>
Date: Tuesday, January 23, 2018 at 10:09 PM
To: Eliot Kimber <ekimber@contrext.com>, "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] Was OOXML's problem that it should have used JSON not XML?

Oh that’s interesting to note. We took a very similar approach with a Word-to-S1000D conversion tool that we wrote. We invented some intermediate “FlatXML” schema that resolved all the cruft from the Word XML and provided a springboard for the S1000D data module outputs. The final process is: Word XML -> FlatXML -> S1000D data modules.

For that project there was a fairly normalised set of inputs so we got away with hardcoding most of the style->tag mappings in the XSLT, but it should definitely be configurable as you suggest.

One thing we did which I found very useful is implement a FlatXML -> HTML transformation. This let us quickly visualise problems in the Word source data (mis-styled or unstyled text, missing images, hidden text fields, etc.).

// Gareth Oakes

// Chief Architect, GPSL

// www.gpsl.co

From: Eliot Kimber <ekimber@contrext.com>
Date: Wednesday, 24 January 2018 at 13:46
To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] Was OOXML's problem that it should have used JSON not XML?

See https://github.com/dita4publishers/org.dita4publishers.word2dita

This is an XSLT2 framework for producing sophisticated DITA document sets from styled Word documents. The input is Word OOXML.

It makes heavy use of for-each-group to deal with the flat nature of OOXML. Basically the first thing I do is convert the OOXML into a vastly simplified, but still structurally flat, XML that I call “word processing markup language”. It captures the essential information from the OOXML that is then needed to drive the remaining transformation.

The transformation itself is driven by a style-to-tag map specification that defines the mapping from Word styles to DITA elements, including the hierarchical nesting of things.

It not necessarily pretty XSLT code but it does work.

I do appreciate that, as a standard, OOXML is at least documented more or less completely and that the markup design is fairly stable, as opposed to RTF, which was completely undocumented, unspecified, and inconsistent from Word version to Word version.

Cheers,

Eliot Kimber

http://contrext.com

From: Rick Jelliffe <rjelliffe@allette.com.au>
Date: Tuesday, January 23, 2018 at 8:15 PM
To: <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] Was OOXML's problem that it should have used JSON not XML?

Hans-Juergen: Yes, the difference is perhaps more in people's expectation of what the information language promises.

Michael Kay: But who would process OOXML using XSLT in that way? I have built several systems that generate OOXML, and one that reads it and substitutes some values, but I think XSLT (certainly 2.0) is often the wrong technology for complex processing of OOXML inputs, for example because of the flatness, the ZIP, MCE, versions, and the relationships files adds to the indirection. I don't think support of expressing "semantic relationships" was ever a goal for OOXML (especially since the i4i case when MS had to disable some XML support).

Doesn't all it means in a general purpose language with JSON is that you would have to adopt a particular programming pattern when iterating throught the JSON tree: you maintain your visitation stack to allow parent::* access and make indexes to allow keyed refences?

I know what you mean by bottom-up versus top-down, and I am not sure that is exactly the case. The original XML format for Word 2003 was top-down and more like early ODF-like (like a neater RTF in XML), but when the got down to the nitty gritty it got unworkable to proceed, so they had to start again. What they did the second time around was have a stronger top-down design patterns [Open Packaging/ZIP, macros (Markup Compatability and Extenions), versions, relationships, separation of concerns with stylesheets and graphics etc in separate files, the properties pattern of attributes) and then tried to pour the their binary format into that, top-down. It may look like bottom-up chaos if you are just expecting a single file, but is systematic. (And then, the way of all flesh, when these extractions failed or were not developed in time, you ended up with lots of bottom-up carbuncles. SNAFU.)

I think we have corresponded before that I think XSD should not even be classed as a "web" technology because it does not allow validation of webs of documents-- it is a file technology: XSLT 1 at least had the document() function, and XSLT 3 has a much stronger story as a web technology with xsl:source-document and xsl:collection etc. Consequently when you get to something like OOXML, the schemas provide no validation between documents in what is a highly linked collection.

Murata-san: Oh, I am not suggesting OOXML be replaced by JSON now! Yikes!

It seems there is a schema language for JSON, JSchema, and there is a converter from XSD to JSchema. And, yes, a large data structure or document needs some method of validation. The work and information required would not be much different.

But does anyone who implements OOXML consumer applications actually read it into a DOM with the XSD and use the PSVI? (Or does anyone use the schemas for dynamic data binding?) I suspect developers would use the schemas to generate code (i.e stub classes for import functions) and then maintain the code by hand. Do you have a feel for this?

(XML-DEVers may not be aware, but one of Murata-san's jobs for the last 10 years has been diligently working through the QA on ISO OOXML, trying to keep up with a moving target, correct where the initial documentation was wrong or speculative or incomplete, and making sure it has the information that stakeholders --such as non-MS developers-- require. Very important work, IMHO. )

Regards

Rick