xml-dev - Re: [xml-dev] UTF-8+names

Re: [xml-dev] UTF-8+names

[ Lists Home | Date Index | Thread Index ]

To: "Tim Bray" <tbray@textuality.com>,"Miles Sabin" <miles@milessabin.com>
Subject: Re: [xml-dev] UTF-8+names
From: "Bob Foster" <bob@objfac.com>
Date: Sun, 19 Oct 2003 17:06:10 -0500
Cc: <xml-dev@lists.xml.org>
References: <20031019034635.91162.qmail@web13703.mail.yahoo.com> <3F92A169.60903@propylon.com> <20031019171750.GI20059@mercury.ccil.org> <200310192101.05193.miles@milessabin.com> <3F92F00E.50603@textuality.com>

From: "Tim Bray" <tbray@textuality.com>
> Miles Sabin wrote:
>
> >   <?xml encoding&=;"UTF-8+names"?>
> >
> > How would that get along with Appendix F style encoding detection?
>
> Since there is no replacement named '&=;', this would be passed to the
> XML processor exactly as you see it there.  The XML processor would
> (correctly) throw it on the floor, because it's not well-formed.
> Where's the problem? -Tim

1. Bad example, but the underlying issue is real. Specifically, the spec
should ensure and comment on the fact that no character in UTF-8+names can
validly appear in an xml declaration and thus can't interfere with encoding
detection. I mean explicitly, not just that the list doesn't happen to
include any of them at this stage of development.

2. You complained about not hearing from the constituencies whose problems
this proposal was intended to address. As an XML editor provider, I'm in
that group once removed. I hear their requirements in this area, which
usually surface as questions about how to use DTD entities along with their
favorite non-DTD schema language (and frustration when they find they often
can't because XML does not require parsers to support entities apart from
validation). If they could do this, I doubt you would ever hear from this
constituency again. Conversely, the main core constituency requirement
UTF-8+names doesn't address is user-defined entities, so if you do
UTF-8+names instead, the constituency will still be unsatisfied.

3. The proposal is certainly a pain for an editor provider. It effectively
introduces an extra stage of processing in between b and c in a) detect
encoding, b) translate encoding to Unicode, c) parse XML. An editor must
make this extra stage an option. Many users will prefer not to do the latter
translation, so they can see the "entities" as entities and edit them as
such.

But since these entities are not seen by an XML processor and can go into
XML names, the editor may either disallow this, and not conform to the
UTF-8+names specification, or implement a new flavor of XML parsing that
accommodates them in XML names. A number of interesting quality issues arise
from this. How is name comparison defined? The UTF-8+names spec does not
restrict users from using a name in one instance with embedded "entities"
and in another instance without. If the two instances are to be seen as
equal, the editor must keep a separate, internal representation of names for
comparison purposes. I hope this scenario sounds familiar to other
editor-writers; it parallels what one had to go through for editors _not_
based on Unicode, a fairly giant step backward.

Many editors offer on-the-fly validation, but of course documents with
entity references in names are not valid, so the document must be translated
from the editor's representation to the parser's for every validation.

Since the pseudo-entities can also appear where DTD-defined entities can
appear, the editor must be able to tell them apart. When a user has
explicitly defined a character entity with the same name as one of the
UTF-8+names entities but a different definition, and the user hovers the
cursor over the entity, which definition should the editor display? Both?

The UTF-8+names list of names is rather large (and probably growing, as
other constituencies weigh in), arguably much larger than most documents
will require. In an editor, when the user types an & or requests code
assist, what list of names should they be shown? The union of the DTD
entities and the +names? It is well-known that very long lists do not work
well with popup lists; users have trouble navigating them without the cursor
falling off the list and can't read them to prompt themselves for plausible
entries. Should the editor therefore add three different ways for the user
to ask for entity name assistance? More options mean more confusion.

I have no doubt that some programmer can code up a bunch of Emacs macros
that address these concerns (except the ease of use parts). But, here's a
news flash, most people don't use Emacs. The list of editors applied to XML
(not to mention the non-XML uses of UTF-8+names) is quite large and some
have quite large constituencies themselves who are highly resistant to
changing editors. Specifications like this should take into account the
amount of grief implied for editor providers and the consequent slow
introduction of satisfactory tools support.

On the other hand, if you want to make this easier for tools to support,
here are three suggestions: Use a different character than &. In your Use
With XML section specifically disparage the use of UTF-8+names in XML
element and attribute names. Drop the &&; escape; you don't need it and
editors certainly don't need any more complexity in entity expansion.

Bob Foster

Follow-Ups:
- Entity support in XML parsers (was: UTF-8+names)
  - From: John Cowan <cowan@mercury.ccil.org>
- RE: [xml-dev] UTF-8+names
  - From: "Alessandro Triglia" <sandro@mclink.it>
- Re: [xml-dev] UTF-8+names
  - From: "Simon St.Laurent" <simonstl@simonstl.com>

References:
- Re: [xml-dev] UTF-8+names
  - From: Mike Champion <mc@xegesis.org>
- Re: [xml-dev] UTF-8+names
  - From: Bill de hÓra <bill.dehora@propylon.com>
- Re: [xml-dev] UTF-8+names
  - From: John Cowan <cowan@mercury.ccil.org>
- Re: [xml-dev] UTF-8+names
  - From: Miles Sabin <miles@milessabin.com>
- Re: [xml-dev] UTF-8+names
  - From: Tim Bray <tbray@textuality.com>

Prev by Date: Re: [xml-dev] UTF-8+names
Next by Date: Re: [xml-dev] UTF-8+names
Previous by thread: Re: [xml-dev] UTF-8+names
Next by thread: Re: [xml-dev] UTF-8+names
Index(es):
- Date
- Thread