XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Determining the text of a leaf node is wicked hard

Defining the rules might be hard (for example, deciding whether comments are significant or not). There are at least 3 specs that do it very thoroughly (DOM, Infoset, and XDM) and they all do it slightly differently, but each is completely clear.

Once you've defined the rules clearly and unambiguously, implementing them in a parser is not difficult at all.

Michael Kay
Saxonica

> On 9 Feb 2022, at 23:31, Roger L Costello <costello@mitre.org> wrote:
> 
> Hi Folks,
> 
> What's the content of this leaf element:
> 
> <Test>Hello, world</Test>
> 
> The content is "Hello, world"
> 
> Easy, right?
> 
> Not so fast.
> 
> Let's look at some other leaf elements.
> 
> What's the content of this leaf element:
> 
> <Test>Harper &amp; Row</Test>
> 
> The text inside the tag is interrupted by an XML entity. The XML entity must be resolved and then spliced together with the text before and after the entity. The content is this: "Harper & Row"
> 
> How about this, what's its content:
> 
> <Test>Equation <![CDATA[A < B]]> done</Test>
> 
> The text inside the tag is interrupted by a CDATA section. The data inside the CDATA section must be extracted, the CDATA syntax discarded, and then the remaining items spliced together. The content is this: "Equation A < B done"
> 
> Here's a leaf element that has a comment:
> 
> <Test>John, <!-- blah, blah -->Paul, and Ringo</Test>
> 
> The text inside the tag is interrupted by a comment. The comment must be discarded, and the remaining items spliced together. The content is this: "John, Paul, and Ringo"
> 
> Now let's mix things together:
> 
> <Test> Harper &amp; Row. Equation <![CDATA[A < B]]> done. John, <!-- blah, blah --> Paul, and Ringo</Test>
> 
> The text inside the tag is interrupted by an entity, a CDATA section, and a comment. The entity must be resolved, the data in the CDATA section extracted, the CDATA syntax discarded, the comment must be discarded, and then the remaining items spliced together. The content is this: "Harper & Row. Equation A < B done. John, Paul, and Ringo"
> 
> There are also numerical entities and PIs to handle. Anything else?
> 
> Imagine trying to write a lexical analyzer (scanner) to handle all these cases, and generate a single text node. Not a trivial task. It will be wicked hard.
> 
> /Roger
> 
> _______________________________________________________________________
> 
> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
> to support XML implementation and development. To minimize
> spam in the archives, you must subscribe before posting.
> 
> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
> subscribe: xml-dev-subscribe@lists.xml.org
> List archive: http://lists.xml.org/archives/xml-dev/
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
> 



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS