This is an interesting example.
I've tried the following XSD 1.1 validation example out of my curiosity, with your sample, and that works fine,
XML instance document,
<?xml version="1.0"?>
<x>
<p>Harper & Row. Equation <![CDATA[A < B]]> done. John, <!-- blah, blah --> Paul, and Ringo</p>
</x>
XSD 1.1 document (which considers, the above mentioned XML instance document as valid),
<?xml version="1.0"?>
<xs:schema xmlns:xs="
http://www.w3.org/2001/XMLSchema">
<xs:element name="x">
<xs:complexType>
<xs:sequence>
<xs:element name="p">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:assertion test="$value = 'Harper & Row. Equation A < B done. John, Paul, and Ringo'"/>
</xs:restriction>
</xs:simpleType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
Hi Folks,
What's the content of this leaf element:
<Test>Hello, world</Test>
The content is "Hello, world"
Easy, right?
Not so fast.
Let's look at some other leaf elements.
What's the content of this leaf element:
<Test>Harper & Row</Test>
The text inside the tag is interrupted by an XML entity. The XML entity must be resolved and then spliced together with the text before and after the entity. The content is this: "Harper & Row"
How about this, what's its content:
<Test>Equation <![CDATA[A < B]]> done</Test>
The text inside the tag is interrupted by a CDATA section. The data inside the CDATA section must be extracted, the CDATA syntax discarded, and then the remaining items spliced together. The content is this: "Equation A < B done"
Here's a leaf element that has a comment:
<Test>John, <!-- blah, blah -->Paul, and Ringo</Test>
The text inside the tag is interrupted by a comment. The comment must be discarded, and the remaining items spliced together. The content is this: "John, Paul, and Ringo"
Now let's mix things together:
<Test> Harper & Row. Equation <![CDATA[A < B]]> done. John, <!-- blah, blah --> Paul, and Ringo</Test>
The text inside the tag is interrupted by an entity, a CDATA section, and a comment. The entity must be resolved, the data in the CDATA section extracted, the CDATA syntax discarded, the comment must be discarded, and then the remaining items spliced together. The content is this: "Harper & Row. Equation A < B done. John, Paul, and Ringo"
There are also numerical entities and PIs to handle. Anything else?
Imagine trying to write a lexical analyzer (scanner) to handle all these cases, and generate a single text node. Not a trivial task. It will be wicked hard.
--