xml-dev - Re: [xml-dev] Character Entities: An XML Core WG View

Re: [xml-dev] Character Entities: An XML Core WG View

[ Lists Home | Date Index | Thread Index ]

To: "'xml-dev'" <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] Character Entities: An XML Core WG View
From: "Rick Jelliffe" <ricko@allette.com.au>
Date: Fri, 1 Nov 2002 18:38:51 +1100
References: <001301c28170$5d1a9a60$6601a8c0@blackie>

From: "Jelks Cabaniss" <jelks@jelks.nu>

> It does, but &#xnnn;'s scattered throughout a document are hard to
> proof.  That's the only reason people want names (and not as
> elements!:).

I think it is more than that.  

Very few fonts have all the characters of Unicode.  Probably no font
has all the characters of Unicode 3.2!    Many systems don't have 
synthetic fonts to allow fallback, and don't check that a glyph is actually
present.  Having an entity mechanism allows system-local mappings
that select a font as well a character.  

We need to be very careful when talking about Unicode that we
don't expect that is solves any problems w.r.t. making display
systems have all glyphs available. People whose tasks is to move
data from A to B can treat XML's Unicode support as relieving
them of lots of difficult problems.  But people whose responsibility it is
to make sure that all the characters that they send appear in 
a final rendered form have to get down and dirty with partial
fonts.

Furthermore, there is not agreement over the best characters to
use for each entity.  Indeed, in a few cases there is positive disagreement
and certain entities changed their characteristic glyph between the
8879 sets, the HTML sets and the newer ISO sets. 

This all springs from SGML's emphasis: which was not guaranteed
interoperability but on rigourous description--adequate details
of the conventions used to allow a recipient (person) to know
what they would need to map on their own system. 

 > > Once again, sigh.  I haven't seen a better idea, but one would be 
> > welcome. 

The only approach that I have seen that makes sense is to build in
a fixed standard set of characters into XML, with known mappings.
Then, for some open-source mapping libaries to be made, so that
developers can trivially add the mapping to their weeny parsers.

Or, so that we can build-in certain entities in parsers.  Or that
vendors of typesetting systems can ensure that the characters that
are standardly mapped to are supported in all fonts (by fallback if
needed.)

Now to do this requires an agreement on what the best mappings for
entities to Unicode strings are.  I have been involved in a project to do just this,
for the last few months, with the intent of taking it to ISO: the task
mainly involves cross-checking DOCBOOKs mappings with W3C
MathML's mappings, and then going through issues from other sources.  
XML-DEV-ers may be interested in the status of this.

There were various mappings of the ISO entities from different sources
before XML. Notable among those were those of the Maler and el Andoloussi
book, from vendors, from HTML, and from TEI.  The Unicode Consortium 
had a checklist of the SGML entities too.  When XML came out, I made
a mapping to Unicode, and John Cowan wrote up the Unicode mapping too,
but these were in terms of Unicode 2.0.  The TEI sets were revised, as
were the HTML sets, including for ISO HTML, I believe. 

But the two main modern mappings efforts have been the DOCBOOK 
mappings at OASIS, associated largely with Norm Walsh and the 
MathML mappings at W3C, associated largely with David Carlisle
who has been particularly helpful. I have been going through these, 
and the other mappings, to see how much agreement there is, and what 
ways forward there might be.

Anyway, my point is that ditching the entity declaration method is a separate
question to ditching standard entity references.  A future version of XML could
keep named character references with defined mappings, while ditching
user-defineable entities.  Getting a standard mapping, or pointing out 
which entities have different usages in different communities, seems to
me the first step that would be needed in any direction.

(As to whether ditching entities is desirable, lets not fool ourselves that 
the pros and cons of standard entities for characters, for internal entitites, 
for exteral parsed entitites and for external unparsed entities are all the same.
They would have to be replaced by four different technologies: making
XML much more complicated. Consider that parameter entities in
WXS needed to be replaced/reconstituted by  about 6 different mechanism:
redefine, import, include, substitution groups, the tag/type distinction,
attribute groups, without even attempting INCLUDE parameters or
variable schemas.  We shouldn't expect to get rid of a generic mechanism
without being saddled with a handful of technologies to take its place:
mind you, we are almost there with XLink/XBase/XInclude "reconstructing" 
many entity functions, though unusable outside specific processing models.)


Cheers
Rick Jelliffe

Follow-Ups:
- Re: [xml-dev] Character Entities: An XML Core WG View
  - From: Mark Doyle <doyle@aps.org>
- Re: [xml-dev] Character Entities: An XML Core WG View
  - From: John Cowan <jcowan@reutershealth.com>

References:
- RE: [xml-dev] Character Entities: An XML Core WG View
  - From: "Jelks Cabaniss" <jelks@jelks.nu>

Prev by Date: Re: [xml-dev] Newbie question:DTD problem(#PCDATA)
Next by Date: Re: [xml-dev] Character Entities: An XML Core WG View
Previous by thread: RE: [xml-dev] Character Entities: An XML Core WG View
Next by thread: Re: [xml-dev] Character Entities: An XML Core WG View
Index(es):
- Date
- Thread