RE: [xml-dev] Saxon and Sun Serializer problems?

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
To: <xml-dev@lists.xml.org>
Date: Sun, 31 May 2009 20:13:44 -0400

At 2009-05-31 10:11 -0700, Jim Tivy wrote:
>comments below.

Thanks, Jim.  I think you've answered my question and I'll offer some 
comments as well.

>Numeric character references are not "dropped" they are converted into their
>equivalent form according to the encoding.

Numeric character references are unrelated to the encoding and are, 
in fact, dropped when replaced with the equivalent Unicode character 
independent of the encoding.  In the data model you will find only 
the character, without any record of whether the character was 
natively included or included by means of a character reference.

>If entities are inlined in the parsing process in the are not "lost", rather
>they are inlined.
>
>CDATA are characters so do not need to be "lost" - just the fact that they
>were treated in a special CDATA section.

Yes to both ... my point in all of these is in regard to what I 
interpreted you to require which was input syntax preservation.  All 
three things I cited are syntax features that are dropped when they 
are processed into the content they represent.  It is the syntax that 
is dropped, not the information.  In all three cases when you look at 
the information in the data model you have no idea what syntactic 
mechanisms may have been used.

>So given so many features for round-tripping that are not there, just
>putting in the DOCTYPE won't fix any of the ones I've cited.
>[<JT>] How many of these are fully lossy and how many have a logical
>equivalent.

Forgive me for not understanding what you mean by "fully lossy".  If 
you are talking syntax, sure many things are lost, but that's just 
syntax (the means to the end) that isn't information (the end).

>How many are we trying to discourage for fully interoperable
>Xml.  My point is DocType limited to Name, PublicId and SystemId is an
>important thing to round trip - sax does it.

*There* is the answer to my question:  you want three items expressed 
in the data model from the DOCTYPE declaration and no aspect of the 
internal declaration subset that is part of the DOCTYPE.  Thank you.

>[<JT>] I am not sure I agree XML editors process the syntax of Xml
>serializations.  Many XML editors operate on DOMs.

I understand a number of editors work on their own private extensions 
to the DOM, but I also understand that using the DOM as standardized 
does not support all of the syntax of an XML document.  Which I 
acknowledged in my earlier email.

>In the DOM the input tree *is* the output tree, unlike
>XSLT and XQuery where the input tree is read-only and the output tree
>is write-only: created, from scratch, in a single pass, without
>backtrack or repair or inspection.
>[<JT>] Without backtrack is a bit unclear - since most XSLT processors are
>based on DOM.

That's merely an implementation perspective.  The XSLT and XQuery 
language definitions do not allow a transformation to backtrack, 
repair or inspect any part of the result tree that has been 
constructed to that point (thus, none at all).  A processor is 
allowed during serialization to serialize and forget an element's 
start tag once the element's content begins.  There are no aspects of 
the language that give the stylesheet writer any information about 
the result tree they've created.

>The XSLT feature of adding a SYSTEM
>identifier is there as I see it really only for the validation
>bit.  Because what is serialized is the information that was used to
>build the result tree ... not the syntax borrowed from the source tree.
>[<JT>] Why does this feature exist in XSLT if DocTypes are irrelevant as you
>suggested in your first question above.

Because one is creating an XML document from scratch and may want to 
ascribe a DOCTYPE declaration to it, as I said for validation purposes.

>Ummmmm .... I can't agree for anything other than XML editors which
>are XML syntax applications not XML information
>applications.
>[<JT>] By syntax I assume you mean "exact serialized form syntax". XML
>Editors do not have to be "syntax" based applications - they operate on DOM
>many times (XMetal).

Again, I think you'll find that XMetal works on their own extensions 
to the DOM and not purely on the DOM as standardized.  Citing 
http://www.w3.org/TR/DOM-Level-3-Core/core.html I read "Note that 
character references and references to predefined entities are 
considered to be expanded by the HTML or XML processor so that 
characters are represented by their Unicode equivalent rather than by 
an entity reference."  So, right there, an XML editor based solely on 
the DOM cannot preserve the user's typing of a numeric character 
reference.  And yet XML editors do preserve that information ... so 
they are working on the syntax of the XML document and not solely on 
the DOM model of the XML document because that information isn't in 
the DOM model.  Which is what I've been trying to say: these 
standardized interfaces to XML documents are not designed to support 
general purpose XML syntax editors.

>[<JT>] I am not saying syntax should be preserved.  I am saying that
>information items should not be "dropped" or lost especially when it is not
>replaced by some other "logical" equivalent.  And DocType is an information
>item that has a purpose in its own right and it should not be dropped.
>Unlike character references which are converted into their equivalent
>underlying character.

You've narrowed it down to the parts of the DOCTYPE that you are 
interested in:  the name of the document element (which is already 
there), the PUBLIC identifier and the SYSTEM identifier.

>[<JT>] My focus is on the idea of progress. Perhaps in the name of progress
>we should not use DocTypes and DTDs but instead use Xml Schema to store our
>validation information  since the schema location will not be lost as it is
>an attribute in the XDM.

Why W3C Schema and not RELAX-NG or NDVL?  Anyway, many people don't 
subscribe to embedding schema references in XML documents because 
validation constraints are arbitrary and any XML document should be 
validatable against any set of validation constraints not just the 
constraints that are embedded.

I grant there is a convenience to some, and there is a project in ISO 
to standardize a processing instruction pointing to a document model 
independent of the model syntax.  And that will show up in the XDM.

>This will not happen since many people agree DTDs
>are here to stay. Then, perhaps we should make the DocType with its public
>and systemIds accessible in the XDM and thus accessible in the input
>document.
><!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN"
>"/SysSchema/dita/topic.dtd">

Fine ... if all you want are those two identifiers and not anything 
from the internal declaration subset of the DOCTYPE, then you've 
answered my question.

Thank you, Jim, for taking the time to clarify your needs.  I don't 
have any further questions in this regard.

. . . . . . . . . . . . Ken

--
XQuery/XSLT/XSL-FO hands-on training - Los Angeles, USA 2009-06-08
Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/x/
Training tools: Comprehensive interactive XSLT/XPath 1.0/2.0 video
Video lesson:    http://www.youtube.com/watch?v=PrNjJCh7Ppg&fmt=18
Video overview:  http://www.youtube.com/watch?v=VTiodiij6gE&fmt=18
G. Ken Holman                 mailto:gkholman@CraneSoftwrights.com
Male Cancer Awareness Nov'07  http://www.CraneSoftwrights.com/x/bc
Legal business disclaimers:  http://www.CraneSoftwrights.com/legal

References:
- Saxon and Sun Serializer problems?
  - From: "Jim Tivy" <jimt@bluestream.com>
- Re: [xml-dev] Saxon and Sun Serializer problems?
  - From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
- RE: [xml-dev] Saxon and Sun Serializer problems?
  - From: "Jim Tivy" <jimt@bluestream.com>
- RE: [xml-dev] Saxon and Sun Serializer problems?
  - From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
- RE: [xml-dev] Saxon and Sun Serializer problems?
  - From: "Jim Tivy" <jimt@bluestream.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]