xml-dev - Re: [xml-dev] UTF-8+names

Re: [xml-dev] UTF-8+names

[ Lists Home | Date Index | Thread Index ]

To: xml-dev@lists.xml.org
Subject: Re: [xml-dev] UTF-8+names
From: James Clark <jjc@jclark.com>
Date: 20 Oct 2003 10:26:38 +0700
In-reply-to: <3F92E11F.4050409@textuality.com>
Organization:
References: <000c01c3966d$7fbb9f70$42a7c044@aldebaran> <3F92E11F.4050409@textuality.com>

On Mon, 2003-10-20 at 02:08, Tim Bray wrote:

> For most 
> encodings of Unicode I know of, if you're editing a text file, any 
> characters that can be displayed are displayed as themselves, not as the 
> underlying UTF-8 bit patterns or whatever.  Characters that *can't* be 
> displayed show up as diamonds or squiggles or boxes.  +names is 
> different in that sometimes a human might want to work with the encoding 
> not the actual Unicode characters

But with +names you don't want to work at the encoding level.  For
example, if you have a ü in your text file, that will be two bytes in
UTF-8+names, but you would want to work with it as a single character.
To edit a UTF-8+names text file, you need to make your text editor treat
it as if it were encoded in UTF-8. In other words, to make things work
you have to edit it in the wrong encoding.  This will be extremely
confusing to users.

I would want to encourage text editors to acquire some XML smarts, and,
in particular, correct handling of the XML encoding declaration.  I want
a text editor to understand the XML encoding declaration and, by
default, use the declared encoding for editing the file.  UTF-8+names is
not going to play well with this.

Overall, this doesn't seem like a step in the right direction to me.

> Bear in mind that the initial problem was the ongoing clamor from 
> communities of people who really want to use the ISO entity sets but 
> don't want to use DTDs.  So far, the standards community has failed to 
> come up with an option that is attractive to them.  +names is just a 
> trial balloon.

Shooting down other people's trial balloons is a lot easier than coming
up with constructive proposals.  Since I'm not on the W3C's XML Plenary
list these days, I'll give my two satangs' worth here.

I think there are at least two different communities (roughly
corresponding to the two different entity sets the UTF-8+names proposal
references) and I think different solutions are appropriate for each
community.

1. General publishing. This community wants the HTML entity sets.  I
think the problem here is a software/education problem which is
decreasing all the time.  Almost all modern systems have fonts that can
display almost all the characters in these entity sets. The desktop
environments that I'm familiar with all offer a character map applet
which is sufficient (albeit not very efficient) for entry of characters
which you have fonts. The quality of Unicode support offered by standard
text editors is improving all the time.

CJK users have long dealt with the problem of how to enter characters
for which their keyboard has no key. CJK software typically provides
"input methods" to allow efficient, user-friendly entry of such
characters. This sort of technology should be applied for entering
Unicode characters.  Input methods can easily leverage the standard
Unicode names, rather than having to invent and maintain a competing set
of shorter names.

I think this community will discover that character entities names are
another bit of SGML-legacy cruft that, with just a little bit of effort,
it is possible to manage without.

2. Math.  I think math users have special requirements.  There's a long
tradition in the math community of using short mnemonic names for
representing math symbols.  There's been a lot of effort in
standardizing a set of names for all the needed symbols.  Some of the
needed entities do not correspond to a single character, but to a
character plus a variant selector, or to a character plus a negating
slash.  Standard text editors/desktop environments are unlikely to
provide the facilities needed for displaying and inputting all the
needed characters in the foreseeable future.  So I don't think telling
the math folks to throw away their list of character names is a
reasonable solution.  Rather I think the right solution is not to use
entities but instead to use an element with an attribute value
specifying the name of the character (IIRC, MathML doesn't need to put
math symbols inside XML attribute values).

James

Follow-Ups:
- RE: [xml-dev] UTF-8+names
  - From: "Alessandro Triglia" <sandro@mclink.it>
- Re: [xml-dev] UTF-8+names
  - From: Tim Bray <tbray@textuality.com>

References:
- RE: [xml-dev] UTF-8+names
  - From: "Alessandro Triglia" <sandro@mclink.it>
- Re: [xml-dev] UTF-8+names
  - From: Tim Bray <tbray@textuality.com>

Prev by Date: xml to text file
Next by Date: ANN: Syntext Serna Beta-3 Release: XSL-on-the-fly WYSIWYG XML Editor
Previous by thread: Re: [xml-dev] UTF-8+names
Next by thread: Re: [xml-dev] UTF-8+names
Index(es):
- Date
- Thread