OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   RE: [xml-dev] UTF-8+names

[ Lists Home | Date Index | Thread Index ]

> -----Original Message-----
> From: Tim Bray [mailto:tbray@textuality.com] 
> Sent: Saturday, October 18, 2003 22:46
> To: Simon St.Laurent
> Cc: xml-dev@lists.xml.org
> Subject: Re: [xml-dev] UTF-8+names
> Simon St.Laurent wrote:
> >>Of course it's cunningly designed to look like an architectural 
> >>change, that allows such syntax as: <&eacute;/>
> Yow.  I hadn't thought of that.  (Hmm, somehow I missed 
> David's message; 
> xml-dev acting up again?)
> > That is therefore an enormous processing model change.  This is way 
> > beyond surrogates. The potential for further disruption on this 
> > precedent seems downright boundless.
> Hmm, it's just an idiotically simple filter that replaces a bunch of 
> hardwired patterns with hardwired Unicode code points.  Hardly feels 
> like a processing model change.

I have another problem with it.

In UTF-8 and UTF-16, there is a single bit pattern for each Unicode
character.  UTF-8+names introduces a kind of non-canonicality in the
encoding itself, which concerns me a little.

There are two cases:

1) a character such as  NON-BREAK SPACE  can be encoded in two different
ways, either as in UTF-8, or as the replacement   0x26 0x6E 0x62 0x73 0x70

2) AMPERSAND can be encoded in two different ways, either as in UTF-8, or as
the replacement   0x26 0x26 0x3B

While (1) is always true for all characters that have a replacement defined
for them (except AMPERSAND), (2) is true if and only if the AMPERSAND is NOT
followed by certain characters and then by a SEMICOLON, the entire sequence
being the same as one of the defined replacements.

This lack of canonicality in the encoding implies that a conversion from
UTF-8 (or UTF-16) to UTF-8+names does not always produce the same result for
the same input.

Also, I wonder about current XML tools.  If a program uses an internal
representation of Unicode characters, how should it generate a UTF-8+names
encoding?  Unlike the characters that make up entity references and numeric
character references (which are individual Unicode characters), the *bytes*
that make up the replacement names of UTF-8+names are not individual Unicode
characters and so don't have a representation as such.

If you view an XML document as a string of Unicode characters, the entity
references and numeric character references are there, but the UTF-8+names
replacements are not there (they are resolved on decoding and are generated
on encoding).

I think the introduction of UTF-8+names would be a *big* change indeed, with
a serious impact on existing XML tools and (some) applications.


> > I wrote a piece on XML as a disruptive technology a few 
> years ago [1], 
> > but I can't say I expected XML to drill into the Unicode layer and 
> > modify the very notion of a character encoding.
> UTF-8+names doesn't depend on XML, I can think of other 
> applications for
> it.  Anyhow Unicode character encodings in widespread use have been 
> cooked up by ANSI, ISO, JIS, and even Bell Labs (that's where 
> UTF-8 came 
> from).  The notion of inventing a new encoding to better serve 
> application needs is hardly radical.  The bar to entry is 
> that you have 
> to have a clear and transparent mapping to Unicode code points, which 
> UTF-8+names does.
> -- 
> Cheers, Tim Bray (http://www.tbray.org/ongoing/)
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org 
> <http://www.xml.org>, an initiative of OASIS 

The list archives are at http://lists.xml.org/archives/xml-dev/

To subscribe or unsubscribe from this list use the subscription
manager: <http://lists.xml.org/ob/adm.pl>


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS