xml-dev - Re: [xml-dev] Character Entities: An XML Core WG View

Re: [xml-dev] Character Entities: An XML Core WG View

[ Lists Home | Date Index | Thread Index ]

To: <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] Character Entities: An XML Core WG View
From: "Rick Jelliffe" <ricko@allette.com.au>
Date: Tue, 5 Nov 2002 20:01:27 +1100
References: <200211041622.LAA04660@mail2.reutershealth.com>

From: "John Cowan" <jcowan@reutershealth.com>

> But character entity names are all drawn from the same Unicode character space.  
> I have yet to see a principled defense for supporting inconsistency here.

I am not sure I get what you are saying, but the ISO, AMS, HTML and seemingly the
MathML entities were developed separately and sometimes prior to Unicode.

Unicode Consortium has used some rules for unifying characters, but sometimes
the entities serve the purpose of getting specialized variants: for example, the
ISOGrk4 entities have no equivalents in Unicode (still, I believe).  They are
supposed to be bold versions of the Greek characters for maths use.  So a real 
mapping of these characters should included some PI or html:font (no flames) 
element to select the style (or something to map them to the PUA).     

This is a class of publishing characters that the Unicode consortium says
are variants and to be handled by a higher-layer.  Entities allows these characters
to be named, and specific purpose mappings made.  And that is where it
falls apart: you cannot provide these kinds of mappings without resorting to
elements or PIs.  PIs being out of fashion, it means that these publishing characters
can only be used in concert with a specific element vocabulary: MathML is
probably the exemplar here.    

Another example of this problem is where Unicode has unified a character, but
there has been regional variants: then it becomes essential to have an indication
of the locale (xml:lang).

But every avenue to cope with this is being blocked off:
  * W3C I18n WG deprecates PUA characters, or ways to make use of them in public
  * W3C misc deprecate use of PIs, for example to allow CSS properties embedded in
    text for font selection
  * W3C Schemas split out schemas to be post-parse and have not provided, say, an 
   annotation for allowing entity definitions to be bundled with schemas
  * The Unicode Consortium has helped/hindered things by providing a variant
   selector character, but this is as yet disconnected from any standard to make
   use of these, and such a standard (in the markup world) would ultimately
   map to a PUA, an element, or a PI anyway. 
  * Almost no non-core W3C XML specification treats attribute inheritence
   seriously: there is no way to say for an element type "my ancestor's attribute 
    xxx is in scope for me, if I am cut out from my context I need to take
    that attribute with me". (A defaulting type similar to SGML's #CURRENT but only
    working on ancestors would be a big step forward.)
   * The infoset simplification that treats entities as macros, to be forgotten after
    parsing.  Imagine, for example, how much simpler life would be if there were
   an XSLT mode that silently shoved through undeclared entity references as part
   of text.   This is certainly no criticism of the XML Infoset spec.

(I believe this shows a systematic problem in Unicode. Of course, we have to play with
the cards we have been dealt, but another approach would have been that used
by the CCCII format:  every character is made from a base and a variant selector,
potentially allowing systems to fall back to a close glyph if the desired glyph is
not available, and avoiding the need for higher-level protocols to supply variant
information. But recognizing an approach has certain problems is not to say Unicode
is not correct for XML: XML has to work with fairly unified characters: publishers
and technical people have to work with very specific characters and we need 
the glue.)

So it is quite possible that two entities could be mapped to the same Unicode
string (e.g. "-") or that an entity could be mapped by different people to different
strings (e.g. should &heart; be the filled or unfilled character?) or that there
is no corresponding Unicode string for an entity (e.g. the &fjlig; which has the trivial
fallback "fj") or that a mapping for an entity requires some kind of variant
selector or markup, and therefore a higher-level protocol.)

The standard entity sets were designed to allow workarounds to system-specific
issues. "The system" in XML's case is Unicode.   The XML as "atomic strings
in trees" kind of view that is particularly associated with database people, makes
the assumption of entity=Unicode string, and so keeps solutions from being
developed.

The thing is that this can almost all be coped with by 
  * catalogs, which let a terminal system provide the mappings it understands,
    for the standard entity sets, including PIs
  * transformation systems that allow standard entities references to emerge
    after the transformation as entity references again, transparently to 
    to the user (i.e. as part of data)

These were fairly commonplace things for SGML systems, and XML needs
to catch up in order to support MathML and publishing.  

We hear very often that XML needs adjustment to cope with the needs of
data exchange, but it also needs adjustment to cope with quality 
document production issues. (Actually, not so much XML itself as
infoset-manipulating systems such as XSLT.)

Cheers
Rick Jelliffe

Follow-Ups:
- Re: [xml-dev] Character Entities: An XML Core WG View
  - From: Elliotte Rusty Harold <elharo@metalab.unc.edu>

References:
- Re: [xml-dev] Character Entities: An XML Core WG View
  - From: John Cowan <jcowan@reutershealth.com>

Prev by Date: Re: [xml-dev] Character Entities: An XML Core WG View
Next by Date: Re: [xml-dev] Character Entities: An XML Core WG View
Previous by thread: Re: [xml-dev] Character Entities: An XML Core WG View
Next by thread: Re: [xml-dev] Character Entities: An XML Core WG View
Index(es):
- Date
- Thread