[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
RE: [xml-dev] Saxon and Sun Serializer problems?
- From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
- To: <xml-dev@lists.xml.org>
- Date: Sun, 31 May 2009 20:13:44 -0400
At 2009-05-31 10:11 -0700, Jim Tivy wrote:
>comments below.
Thanks, Jim. I think you've answered my question and I'll offer some
comments as well.
>Numeric character references are not "dropped" they are converted into their
>equivalent form according to the encoding.
Numeric character references are unrelated to the encoding and are,
in fact, dropped when replaced with the equivalent Unicode character
independent of the encoding. In the data model you will find only
the character, without any record of whether the character was
natively included or included by means of a character reference.
>If entities are inlined in the parsing process in the are not "lost", rather
>they are inlined.
>
>CDATA are characters so do not need to be "lost" - just the fact that they
>were treated in a special CDATA section.
Yes to both ... my point in all of these is in regard to what I
interpreted you to require which was input syntax preservation. All
three things I cited are syntax features that are dropped when they
are processed into the content they represent. It is the syntax that
is dropped, not the information. In all three cases when you look at
the information in the data model you have no idea what syntactic
mechanisms may have been used.
>So given so many features for round-tripping that are not there, just
>putting in the DOCTYPE won't fix any of the ones I've cited.
>[<JT>] How many of these are fully lossy and how many have a logical
>equivalent.
Forgive me for not understanding what you mean by "fully lossy". If
you are talking syntax, sure many things are lost, but that's just
syntax (the means to the end) that isn't information (the end).
>How many are we trying to discourage for fully interoperable
>Xml. My point is DocType limited to Name, PublicId and SystemId is an
>important thing to round trip - sax does it.
*There* is the answer to my question: you want three items expressed
in the data model from the DOCTYPE declaration and no aspect of the
internal declaration subset that is part of the DOCTYPE. Thank you.
>[<JT>] I am not sure I agree XML editors process the syntax of Xml
>serializations. Many XML editors operate on DOMs.
I understand a number of editors work on their own private extensions
to the DOM, but I also understand that using the DOM as standardized
does not support all of the syntax of an XML document. Which I
acknowledged in my earlier email.
>In the DOM the input tree *is* the output tree, unlike
>XSLT and XQuery where the input tree is read-only and the output tree
>is write-only: created, from scratch, in a single pass, without
>backtrack or repair or inspection.
>[<JT>] Without backtrack is a bit unclear - since most XSLT processors are
>based on DOM.
That's merely an implementation perspective. The XSLT and XQuery
language definitions do not allow a transformation to backtrack,
repair or inspect any part of the result tree that has been
constructed to that point (thus, none at all). A processor is
allowed during serialization to serialize and forget an element's
start tag once the element's content begins. There are no aspects of
the language that give the stylesheet writer any information about
the result tree they've created.
>The XSLT feature of adding a SYSTEM
>identifier is there as I see it really only for the validation
>bit. Because what is serialized is the information that was used to
>build the result tree ... not the syntax borrowed from the source tree.
>[<JT>] Why does this feature exist in XSLT if DocTypes are irrelevant as you
>suggested in your first question above.
Because one is creating an XML document from scratch and may want to
ascribe a DOCTYPE declaration to it, as I said for validation purposes.
>Ummmmm .... I can't agree for anything other than XML editors which
>are XML syntax applications not XML information
>applications.
>[<JT>] By syntax I assume you mean "exact serialized form syntax". XML
>Editors do not have to be "syntax" based applications - they operate on DOM
>many times (XMetal).
Again, I think you'll find that XMetal works on their own extensions
to the DOM and not purely on the DOM as standardized. Citing
http://www.w3.org/TR/DOM-Level-3-Core/core.html I read "Note that
character references and references to predefined entities are
considered to be expanded by the HTML or XML processor so that
characters are represented by their Unicode equivalent rather than by
an entity reference." So, right there, an XML editor based solely on
the DOM cannot preserve the user's typing of a numeric character
reference. And yet XML editors do preserve that information ... so
they are working on the syntax of the XML document and not solely on
the DOM model of the XML document because that information isn't in
the DOM model. Which is what I've been trying to say: these
standardized interfaces to XML documents are not designed to support
general purpose XML syntax editors.
>[<JT>] I am not saying syntax should be preserved. I am saying that
>information items should not be "dropped" or lost especially when it is not
>replaced by some other "logical" equivalent. And DocType is an information
>item that has a purpose in its own right and it should not be dropped.
>Unlike character references which are converted into their equivalent
>underlying character.
You've narrowed it down to the parts of the DOCTYPE that you are
interested in: the name of the document element (which is already
there), the PUBLIC identifier and the SYSTEM identifier.
>[<JT>] My focus is on the idea of progress. Perhaps in the name of progress
>we should not use DocTypes and DTDs but instead use Xml Schema to store our
>validation information since the schema location will not be lost as it is
>an attribute in the XDM.
Why W3C Schema and not RELAX-NG or NDVL? Anyway, many people don't
subscribe to embedding schema references in XML documents because
validation constraints are arbitrary and any XML document should be
validatable against any set of validation constraints not just the
constraints that are embedded.
I grant there is a convenience to some, and there is a project in ISO
to standardize a processing instruction pointing to a document model
independent of the model syntax. And that will show up in the XDM.
>This will not happen since many people agree DTDs
>are here to stay. Then, perhaps we should make the DocType with its public
>and systemIds accessible in the XDM and thus accessible in the input
>document.
><!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN"
>"/SysSchema/dita/topic.dtd">
Fine ... if all you want are those two identifiers and not anything
from the internal declaration subset of the DOCTYPE, then you've
answered my question.
Thank you, Jim, for taking the time to clarify your needs. I don't
have any further questions in this regard.
. . . . . . . . . . . . Ken
--
XQuery/XSLT/XSL-FO hands-on training - Los Angeles, USA 2009-06-08
Crane Softwrights Ltd. http://www.CraneSoftwrights.com/x/
Training tools: Comprehensive interactive XSLT/XPath 1.0/2.0 video
Video lesson: http://www.youtube.com/watch?v=PrNjJCh7Ppg&fmt=18
Video overview: http://www.youtube.com/watch?v=VTiodiij6gE&fmt=18
G. Ken Holman mailto:gkholman@CraneSoftwrights.com
Male Cancer Awareness Nov'07 http://www.CraneSoftwrights.com/x/bc
Legal business disclaimers: http://www.CraneSoftwrights.com/legal
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]