XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Why the double escape for lt ? (that is <!ENTITY lt"&#38;#60;"> )

Thank you Ghislain,

Your comment was useful, and helped pull my mind out of the rut it was in. Still, it doesn’t really seem to offer a definitive reason why internal and external entities are processed differently.

Without the double escaping, the entity values would already be & and <, meaning that they would be recognized, quoting the spec: "as though it were part of the document at the location the reference was recognized.", i.e., they would be confused with the start of a reference or of a tag. Here is an example of not well-formed document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
  <!ENTITY bar "&#60;">
]>
<foo>&bar;</foo>

This is a great example. I think I’ve been caught up, partially, on why &q; doesn’t thus barf in the following.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
  <!ENTITY bar "&#38;#60;">
  <!ENTITY q "&#34;">
]>
<foo blah="&q;">&bar;</foo>

But, I guess this makes sense if one takes the algorithm for attribute value normalization as the primary way of processing the attribute values. This does make parsing of AttValue substantially different than that of content... I’m not sure I’d gotten that pounded deeply enough into my head until now, so thank you.

As for the difference of treatment, my mental picture is that a parsed external entity is already looking like an XML document (a fragment), which eliminates the need for the first round of parsing.

I agree with you.  Yet, it still seems like an “arbitrary” decision, even given that.  If one simply assumed that an internal entity looked like an XML document, there’s no harm that I can see. The replacement text of both is just the replacement text of both which then needs to be processed. (the only difference I see is just that external entities can’t be used in attribute values, while internal ones can.  But that, also, doesn’t lead me to any contradictions).

I’d be happier if there were a really specific reason this is needed. Perhaps the answer is located in the greater land of SGML (where I’ve not investigated yet)?

Thanks again for the quick response!

david






On Nov 4, 2016, at 12:32 AM, Ghislain Fourny <gfourny@inf.ethz.ch> wrote:

Hi David,

The double escaping is needed because the literal entity value of an internal entity is actually parsed twice: once when processing the entity declaration and obtain the entity value, and once when the entity is referenced.

There is an informal explanation here:

This means that the value of the entity amp, after a first round of parsing, will be &#38;, and that of the entity lt will be &#60;. When referred to with &amp; or &lt;, the character reference will then be resolved to & and to < in the content.

Without the double escaping, the entity values would already be & and <, meaning that they would be recognized, quoting the spec: "as though it were part of the document at the location the reference was recognized.", i.e., they would be confused with the start of a reference or of a tag. Here is an example of not well-formed document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
  <!ENTITY bar "&#60;">
]>
<foo>&bar;</foo>

As for the difference of treatment, my mental picture is that a parsed external entity is already looking like an XML document (a fragment), which eliminates the need for the first round of parsing.

My explanations may not use the exact terminology, but I hope I could share my mental picture accurately. :-)

Kind regards,
Ghislain


From: David John Burrowes [biede0@gmail.com]
Sent: Friday, November 04, 2016 7:03 AM
To: xml-dev@lists.xml.org
Subject: [xml-dev] Why the double escape for lt ? (that is <!ENTITY lt "&#38;#60;"> )

Hello my dear XML friends,

In the XML spec, it says that the entities lt and amp must be declared (as an internal entity) with double-escaping.  e,g,
<!ENTITY lt "&#38;#60;”>
I see that this follows from the manner that the replacement text for internal entities is defined (that internal entities have character references expanded, but external entities do not).

This leads to two questions for me:
1) Why are internal and external entities’ replacement text handled differently?
2) What is an example of the case where, if the double-escaping wasn’t done, you would get an non-well-formed result (as implied by 4.6 ¶ 2)?

Thank you,

David


Appendix:

Red highlighting added by me to emphasize the key points above.

4.5 Construction of Entity Replacement Text

In discussing the treatment of entities, it is useful to distinguish two forms of the entity's value. [Definition: For an internal entity, the literal entity value is the quoted string actually present in the entity declaration, corresponding to the non-terminal EntityValue.] [Definition: For an external entity, the literal entity value is the exact text contained in the entity.] [Definition: For an internal entity, the replacement text is the content of the entity, after replacement of character references and parameter-entity references.] [Definition: For an external entity, the replacement text is the content of the entity, after stripping the text declaration (leaving any surrounding white space) if there is one but without any replacement of character references or parameter-entity references.]


4.6 Predefined Entities

[Definition: Entity and character references may both be used to escape the left angle bracket, ampersand, and other delimiters. A set of general entities (ampltgtaposquot) is specified for this purpose. Numeric character references may also be used; they are expanded immediately when recognized and must be treated as character data, so the numeric character references "&#60;" and "&#38;" may be used to escape < and & when they occur in character data.]

All XML processors must recognize these entities whether they are declared or not. For interoperability, valid XML documents should declare these entities, like any others, before using them. If the entities lt or amp are declared, they must be declared as internal entities whose replacement text is a character reference to the respective character (less-than sign or ampersand) being escaped; the double escaping is required for these entities so that references to them produce a well-formed result. If the entities gtapos, or quot are declared, they must be declared as internal entities whose replacement text is the single character being escaped (or a character reference to that character; the double escaping here is optional but harmless). For example:

<!ENTITY lt     "&#38;#60;">
<!ENTITY gt     "&#62;">
<!ENTITY amp    "&#38;#38;">
<!ENTITY apos   "&#39;">
<!ENTITY quot   "&#34;">



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS