xml-dev - Re: [xml-dev] Character Entities: An XML Core WG View

Re: [xml-dev] Character Entities: An XML Core WG View

[ Lists Home | Date Index | Thread Index ]

To: <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] Character Entities: An XML Core WG View
From: "Rick Jelliffe" <ricko@allette.com.au>
Date: Sat, 2 Nov 2002 02:30:04 +1100
References: <200211011407.JAA09889@mail2.reutershealth.com> <200211011418.OAA22650@penguin.nag.co.uk>

From: "David Carlisle" <davidc@nag.co.uk>

> One of the differences between mathml and docbook is that I treated all
> entites as in 1.4 eg defining jnodot to be "j" whereas Norm couldn't
> bring himself to abuse quite as many names... the other difference is
> that MathML uses more of Unicode 3, long arrows, VS1 etc.

This issue is one that won't be solved by appeals to authority (even though
having a some kind of joint ISO/OASIS/WC effort would certainly have
some kind of standing).  Nor will can be addressed solely by technical
considerations: many entities have plausible alternative mappings.

For example, the hyphen character. It the aim of a mapping was to
be universally printable, we would choose the simple "-" character
to map to.  But if our aim was to look different from "-" (otherwise
why would people use it) on many web systems we could map it to 
an ndash because that is widely available in many fonts. And if
we wanted it for its typographical properties, we would map it to
the actual unambiguous hyphen character that unicode provide.
But some Japanese company might have mapped it to the 
fullwidth hyphen-minus.   And do you map to  Unicode 2,
3, 3.2 or the non-surrogate 3.2?   Do you allow composed
character (i.e. to get negated characters)?   Furthermore,
if you assume that the MathML, AMS and extended 
ISO entities are all in the set, you might choose a more
specific character knowing the entity would be used
unequivocally.

We have to live with people providing their own mappings
for the standard entities, as their specific systems can
cope with.  But for most people, an agreed basic and 
unified mapping would be helpful.

That is why I think the first step is to survey the issues and 
see which (if any sets) are pretty well universally agreed
on in their mappings.  I think the following sets can be
pinned down without controversy:

1) The MathML sets are under the control of W3C.
2) The ISO sets corresponding to existing character sets
  and I am not aware of controversy on the mappings
    ISOLat1
    ISOLat2
    ISOGrk1
    ISOGrk2
    ISOCyr1
    ISOCyr2
    ISOBox

The following sets have variations that I think are resolvable,
perhaps by coin toss or by deciding on a set of rules
    ISONum
    ISOPub
    ISOTech (note many entities added in the ISO TR version)
    ISOGrk4
    ISO American Mathematical Society entities

The following sets have a couple of characters for which there 
may be unresolvable disgreement. 
    ISOgrk3
    ISOdia
  
Where there is unresolvable disagreement, the answer is probably
to deprecate the ambiguous characters and clearly establish new
ones, IMHO. 

I mentioned that establishing rules can help: what might these
be?   I suggest that a standard mapping for XML should be
concerned with availability: so where there are multiple candidate
mappings for non-AMS/MathML entity sets, we should choose the 
one which

   1) Is commonly implemented in fonts for the major platforms
    at the current time: e.g. the symbol font, Zampf dingbats, 
    Ariel MS, Verdana, the Helvetica on the Mac, and so on.
    
    2) Is in the same Unicode block as other entities in the same
    set: e.g. use the Western characters for lang and rang not 
    the Eastern

    3) Uses the most specific Unicode character from, say,
     Unicode 3.0 vintage: i.e. no surrogates

    4) Should be consistant: e.g. DOCBOOK has &heart;
     white but the other houses black.  

    5) Should follow ISO explanatory text: e.g. DOCBOOK
    has &star; filled but the ISO text says it is an open star
    and provide &starf; for the filled.

    6) Should follow the XML 1.1/Charmod rules for entity
     normalization: e.g. all the accents marks must be spacing.
    (This has an impact on the question that Mark Doyle asked!)
     
    7) Provides mappings for all characters: e.g. &fjlig; has the
     mapping "fj" and not the not-found character.

Unlike Ann, I don't think the W3C is the best group to figure out
a good mapping. This is because they are just one of the stakeholders,
and they must have an interest in just rubberstamping whatever
HTML has done. 

> I find it frankly astonishing that Unicode 3 didn't take
> as a _requirement_  that it support all the characters that had ISO
> entity definitions.

On the other hand, part of the point of the ISO entities is that they
are intentionally flexible, so that people can choose the best possible
mappings from characters that actually have counterparts in fonts.
It might have seemed perverse to them to add characters intended
to be ambiguous. :-)



Cheers
Rick Jelliffe

Follow-Ups:
- Re: [xml-dev] Character Entities: An XML Core WG View
  - From: Ann Navarro <ann@webgeek.com>
- Re: [xml-dev] Character Entities: An XML Core WG View
  - From: David Carlisle <davidc@nag.co.uk>

References:
- Re: [xml-dev] Character Entities: An XML Core WG View
  - From: John Cowan <jcowan@reutershealth.com>
- Re: [xml-dev] Character Entities: An XML Core WG View
  - From: David Carlisle <davidc@nag.co.uk>

Prev by Date: Re: [xml-dev] Character Entities: An XML Core WG View
Next by Date: Re: [xml-dev] Character Entities: An XML Core WG View
Previous by thread: Re: [xml-dev] Character Entities: An XML Core WG View
Next by thread: Re: [xml-dev] Character Entities: An XML Core WG View
Index(es):
- Date
- Thread