RE: [xml-dev] Saxon and Sun Serializer problems?

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
From: "Jim Tivy" <jimt@bluestream.com>
To: "'G. Ken Holman'" <gkholman@CraneSoftwrights.com>,<xml-dev@lists.xml.org>
Date: Sun, 31 May 2009 10:11:28 -0700
Hi Ken 

comments below.

-----Original Message-----
From: G. Ken Holman [mailto:gkholman@CraneSoftwrights.com] 
Sent: Saturday, May 30, 2009 4:26 PM
To: xml-dev@lists.xml.org
Subject: RE: [xml-dev] Saxon and Sun Serializer problems?

Thank you for engaging me on these details, Jim.

At 2009-05-30 15:10 -0700, Jim Tivy wrote:
>Hi Ken
>
>I read what you said below.  The jist seems to be:
>
>Why would you want to do this?

I'm sorry I didn't make myself clear.  My jist was:  what feature(s) 
does having the DOCTYPE give you?  Which is different.  There are so 
many other reasons why an XML document cannot be round-tripped 
through XSLT that just providing the DOCTYPE feature won't solve.  I 
cited the lack of preservation of CDATA sections, the lack of 
preservation of the entity references (which includes numeric 
character references (not even resolved by a DOCTYPE), internal 
parsed general entities, external parsed general entities), and there 
are others including no link to NOTATION declarations for processing 
instruction target de-referencing (a very sore point of mine that the 
designers of XML processing interfaces have never felt the need to support).
[<JT>] When using Xml as a content format users never see the syntax of the
underlying Xml.  So all the things you mentioned above are not a problem
except dropping the DocType. If you are using the DocType to indicate the
validation rules of the document, then dropping that DocType means you can
no longer validate the document.  As well, you may use the schema
information in the DTD in context sensitive help in an XML Editor to show
the next sibling or child element that can be inserted.

Numeric character references are not "dropped" they are converted into their
equivalent form according to the encoding.

If entities are inlined in the parsing process in the are not "lost", rather
they are inlined.

CDATA are characters so do not need to be "lost" - just the fact that they
were treated in a special CDATA section.



So given so many features for round-tripping that are not there, just 
putting in the DOCTYPE won't fix any of the ones I've cited.
[<JT>] How many of these are fully lossy and how many have a logical
equivalent.  How many are we trying to discourage for fully interoperable
Xml.  My point is DocType limited to Name, PublicId and SystemId is an
important thing to round trip - sax does it.

>I should point out the "this" had to do with using SAX in java with the
jaxp
>Identity Transform.  However, I now extend it more tentatively to include
>the "no DocType in the XDM" problem.

Yes, I saw that.  I was trying to figure out what it was about the 
DOCTYPE that you would get when you can't get other things left out 
of the infoset or XDM.

>To give you some context of what I am doing - my need is primarily
pragmatic
>- I am a java programmer trying to get from A to B.

Fine ... I won't hold that against you.  :{)}

>In an Xml content management system users use a variety of Xml processors
>(or programs if you would prefer) like diverse Xml Editors - XMetal, Epic,
>XmlMind and the content management systems that have file Store and
Retrieve
>capabilities as well as link extract and other Xml processing needs.  All
of
>these parts "process" Xml.

Actually, they process XML syntax, they don't process the information 
in an XML document.  XSLT and XQuery were designed to build new 
structures from the information in structured sources.  They were not 
designed to process the syntax of an XML document.  XML editors, in 
particular, are designed to process the syntax of an XML document, 
and as we old (er, long-time) SGML'ers learned long ago you can't 
base an XML editor on an XML processor in the same way you can't base 
an SGML editor on an SGML processor.
[<JT>] I am not sure I agree XML editors process the syntax of Xml
serializations.  Many XML editors operate on DOMs.

Now the DOM *does* have a few features that process some (not all!) 
of the syntax of an XML document, but the perspective is 
different.  In the DOM the input tree *is* the output tree, unlike 
XSLT and XQuery where the input tree is read-only and the output tree 
is write-only: created, from scratch, in a single pass, without 
backtrack or repair or inspection.
[<JT>] Without backtrack is a bit unclear - since most XSLT processors are
based on DOM.

>All of these parts rely on the DocType for
>validation or element insertion help or just need it to "round trip" the
Xml
>so other processors can use that DocType.  Without the DocType, the
>serialization looses some serious part of its capability.

Well now you've lost me again, because the limited number of 
serialization features in XSLT/XQuery renders the information found 
in the DOCTYPE quite irrelevant.  The XSLT feature of adding a SYSTEM 
identifier is there as I see it really only for the validation 
bit.  Because what is serialized is the information that was used to 
build the result tree ... not the syntax borrowed from the source tree.
[<JT>] Why does this feature exist in XSLT if DocTypes are irrelevant as you
suggested in your first question above.

>Most of these parts operate on the serialization of the Xml from time to
>time.  Editors read serializations, users import serializations -
>serializations are the standard way of exchanging and making xml processors
>interoperable.

Ummmmm .... I can't agree for anything other than XML editors which 
are XML syntax applications not XML information 
applications.
[<JT>] By syntax I assume you mean "exact serialized form syntax". XML
Editors do not have to be "syntax" based applications - they operate on DOM
many times (XMetal).   

  XML-based applications are interoperable because the 
XML processors all deliver the same content information to the 
applications using them.  And the decision by designers of DOM to 
include syntax related issues (note again, not all syntax related 
issues) can enable many aspects of input syntax preservation because 
the DOM is acting *on* the document.  XSLT and XQuery are not acting 
*on* the document, they are acting on the information found in the document.

>Not being able to use powerful tools like XSLT and Sax to process Xml when
>"round" tripping of the serialization is required, is restrictive to say
the
>least, as these technologies have their own strengths - eg: DOM is not XSLT
>is not SAX.
>
>Fortunately SAX is usuable on Java - just make sure to use the Sun's Trax
>serializer which keeps the docType as the saxon one drops the docType. (see
>earlier post).
>
>Does this begin to motivate the reason why?

I hear what you are trying to say, and I had already interpreted the 
need for syntax preservation to be to round trip the syntax of an XML 
document, but I haven't yet heard a justification for adding the 
DOCTYPE to XDM.  Adding the DOCTYPE to XDM doesn't give you 
round-tripping of an arbitrary XML document because so much more 
would be needed.  And all of it would be out of scope for XSLT/XQuery.
[<JT>] I am not saying syntax should be preserved.  I am saying that
information items should not be "dropped" or lost especially when it is not
replaced by some other "logical" equivalent.  And DocType is an information
item that has a purpose in its own right and it should not be dropped.
Unlike character references which are converted into their equivalent
underlying character.  

This comes up often in the classroom from students who thought XSLT 
and XQuery could/should be used for XML document syntax 
preservation.  Because XSLT and XQuery are node-tree-transformation 
tools and not XML syntax tools, they cannot be used for syntax 
preservation.  XSLT and XQuery are not angle-bracket processors, they 
are node-tree processors.  Serialization is not needed when the 
processor is embedded in, say, an XSL-FO engine.  Serialization is a 
nice-to-have that allows one to create artefacts that can be useful 
as input to other XML-based tools.
[<JT>] My focus is on the idea of progress. Perhaps in the name of progress
we should not use DocTypes and DTDs but instead use Xml Schema to store our
validation information  since the schema location will not be lost as it is
an attribute in the XDM. This will not happen since many people agree DTDs
are here to stay. Then, perhaps we should make the DocType with its public
and systemIds accessible in the XDM and thus accessible in the input
document.
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN"
"/SysSchema/dita/topic.dtd"> 

Consider source tree data projection:  if I do an XSLT or XQuery 
transformation on a source node tree created from a non-XML source, 
what is the definition of the DOCTYPE?  More to the point, what 
information might there have been put into a DOCTYPE in the 
interpretation of the projection to be useful in the node-tree 
transformation?  I claim there is no such information.
[<JT>] If I serialize it I have validation rules available for the next Xml
Processor.

And I haven't found such information in the response that you've given.

Thank you again for trying to help me better understand what you 
need.  I really am trying to be supportive here to reveal what 
specific features of DOCTYPE you will find helpful.

. . . . . . . . . . . . . . Ken


--
XQuery/XSLT/XSL-FO hands-on training - Los Angeles, USA 2009-06-08
Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/x/
Training tools: Comprehensive interactive XSLT/XPath 1.0/2.0 video
Video lesson:    http://www.youtube.com/watch?v=PrNjJCh7Ppg&fmt=18
Video overview:  http://www.youtube.com/watch?v=VTiodiij6gE&fmt=18
G. Ken Holman                 mailto:gkholman@CraneSoftwrights.com
Male Cancer Awareness Nov'07  http://www.CraneSoftwrights.com/x/bc
Legal business disclaimers:  http://www.CraneSoftwrights.com/legal


_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
Follow-Ups:
- RE: [xml-dev] Saxon and Sun Serializer problems?
  - From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
References:
- Saxon and Sun Serializer problems?
  - From: "Jim Tivy" <jimt@bluestream.com>
- Re: [xml-dev] Saxon and Sun Serializer problems?
  - From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
- RE: [xml-dev] Saxon and Sun Serializer problems?
  - From: "Jim Tivy" <jimt@bluestream.com>
- RE: [xml-dev] Saxon and Sun Serializer problems?
  - From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]