Re: [xml-dev] [Summary] How can the content of a leaf element bemultiple

On Sat, 12 Feb 2022 at 13:39, Roger L Costello <costello@mitre.org> wrote:

Thank you again Michael, Ken, and Liam for your outstanding explanations! Here is my summary of all that I learned:

An XML leaf element can contain more than one text node. For example, suppose that a <Test> leaf element contains "abc" and "def" and they are separated by a comment:

<Test>abcdef</Test>

The <Test> element contains two text nodes:

text[1] = "abc"
text[2] = "def"

A leaf element will never contain two adjacent text nodes.

That depends on the data model being used. The Xpath data model does not have adjacent text nodes but for example the W3C/Whatwg DOM does. An xpath query over a dom implementation needs to hide the underlying implementation and return as if the nodes were merged.

For example, in this <Test> element "abc" and "def" are separated by a space:

<Test>abc def</Test>

There is only one text node, and its value is "abc def"

If "abc" and "def" are separated by a processing instruction (PI):

<Test>abc<?foo test?>def</Test>

then again there are two text nodes.

However, if "abc" and "def" are separated by a CDATA section:

<Test>abc<![CDATA[blah]]>def</Test>

then there is only one text node, and its value is: abcblahdef

Again some data models represent CDATA sections in the Node tree, eg

https://dom.spec.whatwg.org/#cdatasection

An XPath query can not address such a CDATA section.

The CDATA section is simply a wrapper about text; the wrapper is removed by the XML parser.

It may or may not be removed by the parser, but if it isn't removed, the XPath query engine needs to hide it.

If "abc" and "def" are separated by an entity:

<Test>abc&def</Test>

well an entity reference not an entity (the entity is an ampersand character and & is a pre-defined reference to such a character).

then there is only one text node, and its value is: abc&def

again that is a feature of the XPath data model not necessarily a feature of an XML parser. The DOM and the XML infoset can both represent entity reference nodes.

One way to display the text node(s) is to create an XPath expression and then execute the expression. This XPath expression can be used to count the number of text nodes in the <Test> element:

count(Test/text())

This XPath expression can be used to show the content of the first text node:

Test/text()[1]

This can be used to show the content of the second text node (if there is one):

Test/text()[2]

And this can be used to show the sequence of text nodes:

Test/text()

When you execute any of these XPath expressions, you will see a visual representation of the result. That visual representation might be misleading! For example, recall the case where "abc" and "def" are separated by a comment:

<Test>abcdef</Test>

We now know that the <Test> element contains two text nodes. However, when I executed this XPath expression:

Test/text()

I saw this result:

abcdef

Liam executed (using a different XPath tool) the same XPath expression on the same <Test> element and got this result:

-- NODE --

abc

-- NODE --

def

My XPath tool mislead me into thinking that the <Test> element has only one text node.

Similarly, when I ran the same XPath expression on this <Test> element:

<Test>abc&def</Test>

I saw this result:

abc&def

Again, my XPath tool mislead me into thinking that the XML entity was not resolved (i.e., & was not converted to &). In fact, however, the actual result of executing the XPath expression is this:

abc&def

The entity is resolved.

references are resolved.

Important lesson: Distinguish the content of the text node from its visual representation. The XPath spec doesn't say anything about the visual representation.

Also distinguish the in-memory node tree representing your document and the view of that tree represented by XPath queries.

/Roger

David