xml-dev - Is XSLTs handling of CDATA sections too aggresive?

Is XSLTs handling of CDATA sections too aggresive?

[ Lists Home | Date Index | Thread Index ]

From: "Nils Klarlund" <klarlund@research.att.com>
To: <xml-dev@xml.org>
Date: Wed, 9 Feb 2000 09:17:54 -0500

Maybe somebody would like to comment on the question below I sent to
the official XSLT mailing list?

The  question is related to the XML 1.0 errata and to the general problem of
whitespace handling in XML 1.0  without the guiding hand of a DTD.

I discovered this problem while using the
Alphaworks XSLT processor to process marked-up text.
It's more than a practical issue, there are some
important principles involved I think:  let us be able to make XML 1.0
express  trees that are not full of whitespace garbage (to be filtered
by  extraneous programs, DTDs, whatever,...).

thanks in advance,

/Nils

Excerpts:

Message-ID: <04bb01bf7273$90095740$b2e3cf87@research.att.com>
From: "Nils Klarlund" <klarlund@research.att.com>
To: <xsl-editors@w3.org>
Cc: "klarlund" <klarlund@research.att.com>
Date: Tue, 8 Feb 2000 15:32:03 -0500
Subject: Is it right to remove whitespace nodes stemming from CDATA
sections? (No, I think!)

I believe that the way CDATA sections are treated in XPATH/XSLT is not
compatible with the latest Errata to XML 1.0.
(http://www.w3.org/XML/xml-19980210-errata).


Moreover, the way CDATA sections are treated makes it impossible to
adopt a simple view of XML, namely remove all whitespaces nodes,
without a provable loss of expressive power!  This radical pruning
view is desirable for many applications, especially for database
applications, but, also for document oriented processing, where the
usual semantics that introduce tons of whitespace nodes is an
aesthetic and practical problem.

The problem with XSLT is that even a very explicitly marked whitespace such
as

<![CDATA[ ]]>

is eaten up if not in company with non-whitespace characters.  So, I
can't insert spaces between nodes!

In other words, assuming that it is unreasonable that a DTD or
application should make decisions about which whitespace nodes are for
real and which are not, I'm in trouble: I want to prune all whitespace
nodes, except those that I mark as important.

Clearly, as indicated, in the section below, XML 1.0 makes semantic
distinctions between ' ' and <![CDATA[ ]]>.  Thus, XSLT cannot be used
to determine whether some content is "element content".  Does it
appear in error to water down XPATH to that point?

I suggest that the stripping of whitespace nodes explicitly excludes
nodes gotten from or involving CDATA sections.

Thanks

/Nils

>From Errata:

Section 3

Change item number 2 of the list of valid cases for the "Element Valid" VC
to read:

The declaration matches children and the sequence of child elements
belongs to the language generated by the regular expression in the
content model, with optional white space (characters matching the
nonterminal S) between the start tag and the first child element,
between child elements or between the last child element and the end
tag. Note that a CDATA section containing only white space does not
match the nonterminal S, and hence cannot appear in these positions.

Prev by Date: RE: sending xmldoc over net
Next by Date: Re: enumeration and defaults
Previous by thread: SAX2 extensions update (WAS: SAX2 Question: Core Properties)
Next by thread: Re: enumeration and defaults
Index(es):
- Date
- Thread