xml-dev - Re: [xml-dev] UTF-8+names

Re: [xml-dev] UTF-8+names

[ Lists Home | Date Index | Thread Index ]

To: xml-dev@lists.xml.org
Subject: Re: [xml-dev] UTF-8+names
From: Joe English <jenglish@flightlab.com>
Date: Sun, 19 Oct 2003 13:42:12 -0700
In-reply-to: <3F92DDDE.6050609@textuality.com>
References: <3F92DDDE.6050609@textuality.com> <200310191630.RAA23060@e3000>

Tim Bray wrote:

> Well, there's no doubt that +names is optimized for the needs of XML
> users, in that it defines lots of things like &eacu; but *doesn't*
> define the XML magic 5; this means that &lt; and &amp; and so on go
> through untouched, which is what you need for the purposes of XML users.

That's the biggest problem I have with the proposed encoding,
actually.

A human reader staring at a chunk of UTF-8+names-encoded
text can't readily tell if "&xxx;" is really (1) part of
the encoding, to be replaced by a real character before being
fed to the parser, or (2) not part of the encoding, to be
passed through to the parser unchanged; where it will then
either (2a) be interpreted as an XML entity reference or
(2b) signal an error.

I get the feeling this is just asking for trouble.

Just think of what will happen when people start publishing
RSS feeds with UTF8+names-encoded double-escaped XHTML
in the <description> element ('cause you know they will).
Now pretend you're a DPH and you see "&amp;&amp;&semicolon;"
in one of these feeds.  Answer fast: what does that mean?
Can you write a regexp that will process it correctly?

--Joe English

  jenglish@flightlab.com

References:
- Re: [xml-dev] UTF-8+names
  - From: Tim Bray <tbray@textuality.com>

Prev by Date: RE: [xml-dev] UTF-8+names
Next by Date: Re: [xml-dev] UTF-8+names
Previous by thread: Re: [xml-dev] UTF-8+names
Next by thread: Re: [xml-dev] UTF-8+names
Index(es):
- Date
- Thread