Lists Home |
Date Index |
> -----Original Message-----
> From: Tim Bray [mailto:email@example.com]
> Sent: Saturday, October 18, 2003 22:46
> To: Simon St.Laurent
> Cc: firstname.lastname@example.org
> Subject: Re: [xml-dev] UTF-8+names
> Simon St.Laurent wrote:
> >>Of course it's cunningly designed to look like an architectural
> >>change, that allows such syntax as: <é/>
> Yow. I hadn't thought of that. (Hmm, somehow I missed
> David's message;
> xml-dev acting up again?)
> > That is therefore an enormous processing model change. This is way
> > beyond surrogates. The potential for further disruption on this
> > precedent seems downright boundless.
> Hmm, it's just an idiotically simple filter that replaces a bunch of
> hardwired patterns with hardwired Unicode code points. Hardly feels
> like a processing model change.
I have another problem with it.
In UTF-8 and UTF-16, there is a single bit pattern for each Unicode
character. UTF-8+names introduces a kind of non-canonicality in the
encoding itself, which concerns me a little.
There are two cases:
1) a character such as NON-BREAK SPACE can be encoded in two different
ways, either as in UTF-8, or as the replacement 0x26 0x6E 0x62 0x73 0x70
2) AMPERSAND can be encoded in two different ways, either as in UTF-8, or as
the replacement 0x26 0x26 0x3B
While (1) is always true for all characters that have a replacement defined
for them (except AMPERSAND), (2) is true if and only if the AMPERSAND is NOT
followed by certain characters and then by a SEMICOLON, the entire sequence
being the same as one of the defined replacements.
This lack of canonicality in the encoding implies that a conversion from
UTF-8 (or UTF-16) to UTF-8+names does not always produce the same result for
the same input.
Also, I wonder about current XML tools. If a program uses an internal
representation of Unicode characters, how should it generate a UTF-8+names
encoding? Unlike the characters that make up entity references and numeric
character references (which are individual Unicode characters), the *bytes*
that make up the replacement names of UTF-8+names are not individual Unicode
characters and so don't have a representation as such.
If you view an XML document as a string of Unicode characters, the entity
references and numeric character references are there, but the UTF-8+names
replacements are not there (they are resolved on decoding and are generated
I think the introduction of UTF-8+names would be a *big* change indeed, with
a serious impact on existing XML tools and (some) applications.
> > I wrote a piece on XML as a disruptive technology a few
> years ago ,
> > but I can't say I expected XML to drill into the Unicode layer and
> > modify the very notion of a character encoding.
> UTF-8+names doesn't depend on XML, I can think of other
> applications for
> it. Anyhow Unicode character encodings in widespread use have been
> cooked up by ANSI, ISO, JIS, and even Bell Labs (that's where
> UTF-8 came
> from). The notion of inventing a new encoding to better serve
> application needs is hardly radical. The bar to entry is
> that you have
> to have a clear and transparent mapping to Unicode code points, which
> UTF-8+names does.
> Cheers, Tim Bray (http://www.tbray.org/ongoing/)
> The xml-dev list is sponsored by XML.org
> <http://www.xml.org>, an initiative of OASIS
The list archives are at http://lists.xml.org/archives/xml-dev/
To subscribe or unsubscribe from this list use the subscription