Re: [xml-dev] Why isn't the semicolon a reserved character?

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
Date: Sat, 15 Mar 2014 19:33:31 -0400

At 2014-03-15 21:41 +0000, Costello, Roger L. wrote:

This XML document is not well-formed:

<Document>
]]>
</Document>

Why? Because the XML parser see that and thinks that the > symbol marks the end of a CDATA section;

False. The "]]>" marks the end of a CDATA section:

http://www.w3.org/TR/2008/REC-xml-20081126/#NT-CDEnd

A simple ">" in parsed character data is not a problem when it is not preceded by two right square brackets. This comes up in my XML syntax class (which, since December, has been available for streaming on Pluralsight).

The following is well-formed as the simple greater-than symbol does not mark the end of a CDATA section:

<?xml version="1.0" encoding="UTF-8"?>
<doc>
This is a > greater-than symbol.
</doc>

the XML parser throws an error since there is no preceding <![[CDATA

To be precise in a way that answers a later question below, it throws an error because at the point the end of CDATA was encountered it was not in a CDATA section. Which, BTW, you mistyped ... the start of a CDATA section is <![CDATA[ per:

http://www.w3.org/TR/2008/REC-xml-20081126/#NT-CDStart

The > symbol must be escaped like so:

<Document>
]]>
</Document>

Now consider the ; symbol. It marks the end of an entity reference.

This is a well-formed XML document:

<Document>
A;B
</Document>

Why doesn't the XML parser see that and think that the ; marks the end of an entity reference; why doesn't the XML parser throw an error since there is no preceding & symbol?

Because an entity reference is not a "section" of parsed data ... it is a concise markup construct. It is easy to detect the end of an entity reference:

http://www.w3.org/TR/2008/REC-xml-20081126/#NT-EntityRef

Note how the content of an entity reference is a simple name.

The content of a CDATA section is far more complex and so is described using a wildcard:

http://www.w3.org/TR/2008/REC-xml-20081126/#NT-CData

Note the interesting quirk that within a CDATA section there is no such thing as an embedded CDATA section ... the following is well-formed:

<?xml version="1.0" encoding="UTF-8"?>
<doc>
This is a <![CDATA[ section <![CDATA[ <![CDATA[ <![CDATA[ <![CDATA[ ]]>
</doc>

CDATA sections are not allowed in attributes, while entity references are.

Parsed character data character data sections are simply "different" and so are treated different when parsing.

Why isn't the ; symbol a reserved symbol?

What do you mean by "reserved"?

It isn't available as a built-in character entity because it isn't needed to disambiguate otherwise ambiguous strings found in parsed character data.

And it just is, as it was in SGML and so is in XML.

I hope this helps.

. . . . . . Ken

--
Public XSLT, XSL-FO, UBL & code list classes: Melbourne, AU May 2014 |
Contact us for world-wide XML consulting and instructor-led training |
Free 5-hour lecture: http://www.CraneSoftwrights.com/links/udemy.htm |
Crane Softwrights Ltd. http://www.CraneSoftwrights.com/x/ |
G. Ken Holman mailto:gkholman@CraneSoftwrights.com |
Google+ profile: http://plus.google.com/+GKenHolman-Crane/about |
Legal business disclaimers: http://www.CraneSoftwrights.com/legal |

---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com

References:
- Why isn't the semicolon a reserved character?
  - From: "Costello, Roger L." <costello@mitre.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]