xml-dev - RE: [xml-dev] XSL for non-XML input (Was: Re: [xml-dev] XML Hangover)

RE: [xml-dev] XSL for non-XML input (Was: Re: [xml-dev] XML Hangover)

[ Lists Home | Date Index | Thread Index ]

To: "'Pete Cordell'" <petexmldev@tech-know-ware.com>,<xml-dev@lists.xml.org>
Subject: RE: [xml-dev] XSL for non-XML input (Was: Re: [xml-dev] XML Hangover)
From: "Michael Kay" <mike@saxonica.com>
Date: Wed, 13 Jul 2005 09:43:47 +0100
In-reply-to: <012f01c58784$9462c160$b700a8c0@RW>
Thread-index: AcWHhOSPtkZKzKBZSgmjAx+zH/RjSQAAVLIw

Does that imply that XSL could be used on any hierarchical text based data input with a suitable front end that can extract events that mirror XML's elements and attributes etc?

Yes. I demonstrated this way back in the first edition of XSLT Programmer's Reference with an example that front-ended XSLT with a SAX-compliant parser taking the (non-XML) GEDCOM file format as its input. More commonly, XSLT is used with John Cowan's TagSoup parser as a front-end, allowing it to take ill-formed HTML as input.

As an alternative, XSLT 2.0 allows you to take plain text files as input using the unparsed-text() function, and analyze them within your stylesheet code using new regular-expression handling and grouping instructions. This makes most text-based data file formats accessible to XSLT processing.

See http://www.idealliance.org/proceedings/xml04/papers/111/mhk-paper.htm

Michael Kay

Going further, observing the idea of using out of band data (e.g. schema) to provide extra information to complete 'binary XML', could XSL (with suitable front ends) work on say an ASN.1 encoded X.509 certificate (and ASN.1 message definition) and produce, say, a PDF output?

Not that I have a need to do that right now! I'm just interested to know whether XSL can be used as a kind of universal data translator.

Thanks,

Pete.
--
=============================================
Pete Cordell
Tech-Know-Ware Ltd
-----------------------------------------------------------------
                         for XML to C++ data binding visit
                         http://www.tech-know-ware.com/lmx
                         (or http://www.xml2cpp.com)
=============================================

----- Original Message -----

From: Michael Kay

To: 'Joe Schaffner' ; xml-dev@lists.xml.org

Sent: Monday, July 11, 2005 9:00 PM

Subject: RE: [xml-dev] XML Hangover

I've been reading the XML litterature. It's great. Just a few comments:

Welcome on board. It's refreshing to get thoughtful comments from someone who's new to the game.

XSL - XML Stylesheets is divided into two parts, XSL-T and XSL-FO.

The T part deals with templates and translation. Since HTML is valid XML, I guess I can parse my HTML using XSL-T to produce XML and vice versa. I don't understand why XSL-T refers to "nodes in an output tree". This suggests some kind of internal representation, but XML is perfectly good representation language. Don't <templates> merely write XML text to stdout?

No, the result tree is completely abstract, there is no suggestion of an internal representation. In fact, for many XSLT processors, the "result tree" is represented internally as a stream of events, not as a linked collection of objects in memory. This concept of writing a tree, rather than writing text, however is extremely important. Firstly, it defines a separation of the information content of an XML document from the accidental aspects of its lexical representation - something that is sadly missing from the XML spec itself. In turn, this gives you a basis for defining a concise set of operators that are in some sense complete, composable and exhibit closure. In practical terms, it gives you the ability to write a series of transformations - a pipeline - in which the expensive steps of serializing and parsing intermediate results can be eliminated.

Roughly, the process seems to work like this: the T processor does a recursive descent of the source XML. At each node it evaluates the set of templates. Those templates which match the name of the "current" tag are processed, in some order. The template writes text, that's why it's called a "template. The recursive descent is continued with an <apply-templates> tag inside the template. This allows you to balance output.

It doesn't have to do a recursive descent of the source XML: that's up to the application, though a recursive descent is the most common design pattern. And it definitely doesn't write text: people who create a mental model of writing text eventually get a rude awakening, usually when they first try to tackle grouping problems.

If no matches are found, the T processor continues the descent.

There is a <template> tag (I forget what) which will select arbitrary paths in the souce tree, and there are tags which iterate through the result.

Again, it's best to think of the stylesheet as containing nodes (representing instructions) rather than tags. Consider

<xsl:element name="x"><xsl:value-of select="."/></xsl:element>

There are three tags there, but four nodes, and only two instructions. The semantics of the language are described in terms of the two instructions, not the three tags.

This will allow me to build up a result "tree" which is not a mirror image of the source, something I need to do if I'm rearranging sections of the input document. Rather than buffering intermediate structures, the T processor does multiple passes based on these tags, and creates the output on-the-fly. Cool.

... .

I assume there is nothing stopping me from using XSL-T to transform my HTML to PDF, but it seems best to output XSL-FO then create a PDF using some kind of tool. What is that tool?

It's an XSL-FO processor. Examples are FOP, RenderX, Antenna House.

Are there FO plug-ins available for my browsers?

No, people are by-and-large using (X)HTML/CSS for the browser, XSL-FO/PDF for the printed page.

Does this technology work?

Absolutely yes.

Michael Kay

http://www.saxonica.com/

References:
- XSL for non-XML input (Was: Re: [xml-dev] XML Hangover)
  - From: "Pete Cordell" <petexmldev@tech-know-ware.com>

Prev by Date: Re: [xml-dev] XSL for non-XML input (Was: Re: [xml-dev] XML Hangover)
Next by Date: Re: [xml-dev] Idea - std. non-flat <g><e/></g> syntax
Previous by thread: Re: [xml-dev] XSL for non-XML input (Was: Re: [xml-dev] XML Hangover)
Next by thread: Re: [xml-dev] XSL for non-XML input (Was: Re: [xml-dev] XML Hangover)
Index(es):
- Date
- Thread