RE: [xml-dev] Saxon and Sun Serializer problems?

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
To: <xml-dev@lists.xml.org>
Date: Sat, 30 May 2009 19:26:16 -0400

Thank you for engaging me on these details, Jim.

At 2009-05-30 15:10 -0700, Jim Tivy wrote:
>Hi Ken
>
>I read what you said below.  The jist seems to be:
>
>Why would you want to do this?

I'm sorry I didn't make myself clear.  My jist was:  what feature(s) 
does having the DOCTYPE give you?  Which is different.  There are so 
many other reasons why an XML document cannot be round-tripped 
through XSLT that just providing the DOCTYPE feature won't solve.  I 
cited the lack of preservation of CDATA sections, the lack of 
preservation of the entity references (which includes numeric 
character references (not even resolved by a DOCTYPE), internal 
parsed general entities, external parsed general entities), and there 
are others including no link to NOTATION declarations for processing 
instruction target de-referencing (a very sore point of mine that the 
designers of XML processing interfaces have never felt the need to support).

So given so many features for round-tripping that are not there, just 
putting in the DOCTYPE won't fix any of the ones I've cited.

>I should point out the "this" had to do with using SAX in java with the jaxp
>Identity Transform.  However, I now extend it more tentatively to include
>the "no DocType in the XDM" problem.

Yes, I saw that.  I was trying to figure out what it was about the 
DOCTYPE that you would get when you can't get other things left out 
of the infoset or XDM.

>To give you some context of what I am doing - my need is primarily pragmatic
>- I am a java programmer trying to get from A to B.

Fine ... I won't hold that against you.  :{)}

>In an Xml content management system users use a variety of Xml processors
>(or programs if you would prefer) like diverse Xml Editors - XMetal, Epic,
>XmlMind and the content management systems that have file Store and Retrieve
>capabilities as well as link extract and other Xml processing needs.  All of
>these parts "process" Xml.

Actually, they process XML syntax, they don't process the information 
in an XML document.  XSLT and XQuery were designed to build new 
structures from the information in structured sources.  They were not 
designed to process the syntax of an XML document.  XML editors, in 
particular, are designed to process the syntax of an XML document, 
and as we old (er, long-time) SGML'ers learned long ago you can't 
base an XML editor on an XML processor in the same way you can't base 
an SGML editor on an SGML processor.

Now the DOM *does* have a few features that process some (not all!) 
of the syntax of an XML document, but the perspective is 
different.  In the DOM the input tree *is* the output tree, unlike 
XSLT and XQuery where the input tree is read-only and the output tree 
is write-only: created, from scratch, in a single pass, without 
backtrack or repair or inspection.

>All of these parts rely on the DocType for
>validation or element insertion help or just need it to "round trip" the Xml
>so other processors can use that DocType.  Without the DocType, the
>serialization looses some serious part of its capability.

Well now you've lost me again, because the limited number of 
serialization features in XSLT/XQuery renders the information found 
in the DOCTYPE quite irrelevant.  The XSLT feature of adding a SYSTEM 
identifier is there as I see it really only for the validation 
bit.  Because what is serialized is the information that was used to 
build the result tree ... not the syntax borrowed from the source tree.

>Most of these parts operate on the serialization of the Xml from time to
>time.  Editors read serializations, users import serializations -
>serializations are the standard way of exchanging and making xml processors
>interoperable.

Ummmmm .... I can't agree for anything other than XML editors which 
are XML syntax applications not XML information 
applications.  XML-based applications are interoperable because the 
XML processors all deliver the same content information to the 
applications using them.  And the decision by designers of DOM to 
include syntax related issues (note again, not all syntax related 
issues) can enable many aspects of input syntax preservation because 
the DOM is acting *on* the document.  XSLT and XQuery are not acting 
*on* the document, they are acting on the information found in the document.

>Not being able to use powerful tools like XSLT and Sax to process Xml when
>"round" tripping of the serialization is required, is restrictive to say the
>least, as these technologies have their own strengths - eg: DOM is not XSLT
>is not SAX.
>
>Fortunately SAX is usuable on Java - just make sure to use the Sun's Trax
>serializer which keeps the docType as the saxon one drops the docType. (see
>earlier post).
>
>Does this begin to motivate the reason why?

I hear what you are trying to say, and I had already interpreted the 
need for syntax preservation to be to round trip the syntax of an XML 
document, but I haven't yet heard a justification for adding the 
DOCTYPE to XDM.  Adding the DOCTYPE to XDM doesn't give you 
round-tripping of an arbitrary XML document because so much more 
would be needed.  And all of it would be out of scope for XSLT/XQuery.

This comes up often in the classroom from students who thought XSLT 
and XQuery could/should be used for XML document syntax 
preservation.  Because XSLT and XQuery are node-tree-transformation 
tools and not XML syntax tools, they cannot be used for syntax 
preservation.  XSLT and XQuery are not angle-bracket processors, they 
are node-tree processors.  Serialization is not needed when the 
processor is embedded in, say, an XSL-FO engine.  Serialization is a 
nice-to-have that allows one to create artefacts that can be useful 
as input to other XML-based tools.

Consider source tree data projection:  if I do an XSLT or XQuery 
transformation on a source node tree created from a non-XML source, 
what is the definition of the DOCTYPE?  More to the point, what 
information might there have been put into a DOCTYPE in the 
interpretation of the projection to be useful in the node-tree 
transformation?  I claim there is no such information.

And I haven't found such information in the response that you've given.

Thank you again for trying to help me better understand what you 
need.  I really am trying to be supportive here to reveal what 
specific features of DOCTYPE you will find helpful.

. . . . . . . . . . . . . . Ken

--
XQuery/XSLT/XSL-FO hands-on training - Los Angeles, USA 2009-06-08
Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/x/
Training tools: Comprehensive interactive XSLT/XPath 1.0/2.0 video
Video lesson:    http://www.youtube.com/watch?v=PrNjJCh7Ppg&fmt=18
Video overview:  http://www.youtube.com/watch?v=VTiodiij6gE&fmt=18
G. Ken Holman                 mailto:gkholman@CraneSoftwrights.com
Male Cancer Awareness Nov'07  http://www.CraneSoftwrights.com/x/bc
Legal business disclaimers:  http://www.CraneSoftwrights.com/legal

Follow-Ups:
- RE: [xml-dev] Saxon and Sun Serializer problems?
  - From: "Jim Tivy" <jimt@bluestream.com>

References:
- Saxon and Sun Serializer problems?
  - From: "Jim Tivy" <jimt@bluestream.com>
- Re: [xml-dev] Saxon and Sun Serializer problems?
  - From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
- RE: [xml-dev] Saxon and Sun Serializer problems?
  - From: "Jim Tivy" <jimt@bluestream.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]