Hi David,
The double escaping is needed because the literal entity value of an internal entity is actually parsed twice: once when processing the entity declaration and obtain the entity value, and once when the entity is referenced.
There is an informal explanation here:
https://www.w3.org/TR/REC-xml/#sec-entexpand
This means that the value of the entity amp, after a first round of parsing, will be &, and that of the entity lt will be <. When referred to with & or <, the character reference will then be resolved to & and to < in the content.
Without the double escaping, the entity values would already be & and <, meaning that they would be recognized, quoting the spec: "as though it were part of the document at the location the reference was recognized.", i.e., they would be confused with
the start of a reference or of a tag. Here is an example of not well-formed document:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE foo [ <!ENTITY bar "<"> ]> <foo>&bar;</foo> As for the difference of treatment, my mental picture is that a parsed external entity is already looking like an XML document (a fragment), which eliminates the need for the first round of parsing.
My explanations may not use the exact terminology, but I hope I could share my mental picture accurately. :-)
Kind regards,
Ghislain
From: David John Burrowes [biede0@gmail.com]
Sent: Friday, November 04, 2016 7:03 AM To: xml-dev@lists.xml.org Subject: [xml-dev] Why the double escape for lt ? (that is <!ENTITY lt "&#60;"> ) Hello my dear XML friends,
In the XML spec, it says that the entities lt and
amp must be declared (as an internal entity) with double-escaping. e,g,
<!ENTITY lt "&#60;”>
I see that this follows from the manner that the replacement text for internal entities is defined (that internal entities have character references expanded, but external entities do not).
This leads to two questions for me:
1) Why are internal and external entities’ replacement text handled differently?
2) What is an example of the case where, if the double-escaping wasn’t done, you would get an non-well-formed result (as implied by 4.6 ¶ 2)?
Thank you,
David
Appendix:
Red highlighting added by me to emphasize the key points above.
4.5 Construction of Entity Replacement TextIn discussing the treatment of entities, it is useful to distinguish two forms of the entity's value. [Definition: For an internal entity, the literal entity value is the quoted string actually present in the entity declaration, corresponding to the non-terminal EntityValue.] [Definition: For an external entity, the literal entity value is the exact text contained in the entity.] [Definition: For an internal entity, the replacement text is the content of the entity, after replacement of character references and parameter-entity references.] [Definition: For an external entity, the replacement text is the content of the entity, after stripping the text declaration (leaving any surrounding white space) if there is one but without any replacement of character references or parameter-entity references.] 4.6 Predefined Entities[Definition: Entity and character references may both be used to escape the left angle bracket, ampersand,
and other delimiters. A set of general entities ( All XML processors must recognize these entities whether they are declared or not. For
interoperability, valid XML documents should declare these entities, like any others, before using them. If the entities <!ENTITY lt "&#60;"> <!ENTITY gt ">"> <!ENTITY amp "&#38;"> <!ENTITY apos "'"> <!ENTITY quot """> |