Re: [xml-dev] Double escaping

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Peter Flynn <peter@silmaril.ie>
To: xml-dev@lists.xml.org
Date: Sun, 6 Nov 2016 21:20:21 +0000

On 11/06/2016 08:32 PM, Tim Bray wrote:
> Lauren said “why are entity declarations for < and & double-escaped?” I
> said “huh?” She said “xml-dev is arguing about it”.  So, hi again.  
> 
> This is sort of interesting.  If you go all the way back to the first
> edition of the spec and look at section 4.6, after the examples there's
> a one-liner that says:
> 
> Note that the |<| and |&| characters in the declarations of "|lt|" and
> "|amp|" are doubly escaped to meet the requirement that entity
> replacement be well-formed.

As you say below, interoperability. The entity files which accompanied
most SGML DTDs declared non-latin-alphanumerics as SDATA, which
sidestepped the issue by pushing the onus for instantiation onto the
local implementors.

> This is gone, and the paragraph before the examples is re-written, in
> all subsequent revisions starting with the Second Edition.  It’s  a long
> time since I’ve been inside an XML parser, and I confess I don’t fully
> grok why the recommendation says:
> 
> <!ENTITY lt     "&#38;#60;">
> Rather than just 
> <!ENTITY lt "&#60;">

With no SDATA, that gets seen as <!ENTITY lt "<">. But <!ENTITY lt
"&#38;#60;"> get seen as <!ENTITY lt "&#60;"> by which time (one pass)
it's too late to re-resolve it to a bare <. Maybe.

> Now in fact, this is a “for interoperability” which was code for “to
> work with SGML parsers”, and I have never encountered an XML document
> which actually declares &lt; or &amp;  

This is the foundation of a long-standing irritation if you need to
transform XML to TeX, because most parsing software will not allow you
to redeclare lt or amp in the local subset (because you want, for
example, &amp; in character data to resolve to \& or lt to resolve to $<$).

> It’s a long time ago, but I’m pretty sure I didn’t work on any revision
> of the spec after the First Edition; I certainly don’t remember the
> discussion that led up to this change.  By that time, there would have
> been several seasoned XML Processor implementors in the discussion, and
> this would presumably reflect their experience.  

I'm sure there is a good reason for this, but it's a pain in the digital
assets. I assume it's to prevent other character entities (and numeric
entities) being lexed out of existence by the ampersand being
prematurely replaced.

///Peter

References:
- Double escaping
  - From: Tim Bray <tbray@textuality.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]