Lists Home |
Date Index |
From: "David Carlisle" <firstname.lastname@example.org>
> One of the differences between mathml and docbook is that I treated all
> entites as in 1.4 eg defining jnodot to be "j" whereas Norm couldn't
> bring himself to abuse quite as many names... the other difference is
> that MathML uses more of Unicode 3, long arrows, VS1 etc.
This issue is one that won't be solved by appeals to authority (even though
having a some kind of joint ISO/OASIS/WC effort would certainly have
some kind of standing). Nor will can be addressed solely by technical
considerations: many entities have plausible alternative mappings.
For example, the hyphen character. It the aim of a mapping was to
be universally printable, we would choose the simple "-" character
to map to. But if our aim was to look different from "-" (otherwise
why would people use it) on many web systems we could map it to
an ndash because that is widely available in many fonts. And if
we wanted it for its typographical properties, we would map it to
the actual unambiguous hyphen character that unicode provide.
But some Japanese company might have mapped it to the
fullwidth hyphen-minus. And do you map to Unicode 2,
3, 3.2 or the non-surrogate 3.2? Do you allow composed
character (i.e. to get negated characters)? Furthermore,
if you assume that the MathML, AMS and extended
ISO entities are all in the set, you might choose a more
specific character knowing the entity would be used
We have to live with people providing their own mappings
for the standard entities, as their specific systems can
cope with. But for most people, an agreed basic and
unified mapping would be helpful.
That is why I think the first step is to survey the issues and
see which (if any sets) are pretty well universally agreed
on in their mappings. I think the following sets can be
pinned down without controversy:
1) The MathML sets are under the control of W3C.
2) The ISO sets corresponding to existing character sets
and I am not aware of controversy on the mappings
The following sets have variations that I think are resolvable,
perhaps by coin toss or by deciding on a set of rules
ISOTech (note many entities added in the ISO TR version)
ISO American Mathematical Society entities
The following sets have a couple of characters for which there
may be unresolvable disgreement.
Where there is unresolvable disagreement, the answer is probably
to deprecate the ambiguous characters and clearly establish new
I mentioned that establishing rules can help: what might these
be? I suggest that a standard mapping for XML should be
concerned with availability: so where there are multiple candidate
mappings for non-AMS/MathML entity sets, we should choose the
1) Is commonly implemented in fonts for the major platforms
at the current time: e.g. the symbol font, Zampf dingbats,
Ariel MS, Verdana, the Helvetica on the Mac, and so on.
2) Is in the same Unicode block as other entities in the same
set: e.g. use the Western characters for lang and rang not
3) Uses the most specific Unicode character from, say,
Unicode 3.0 vintage: i.e. no surrogates
4) Should be consistant: e.g. DOCBOOK has &heart;
white but the other houses black.
5) Should follow ISO explanatory text: e.g. DOCBOOK
has ☆ filled but the ISO text says it is an open star
and provide ★ for the filled.
6) Should follow the XML 1.1/Charmod rules for entity
normalization: e.g. all the accents marks must be spacing.
(This has an impact on the question that Mark Doyle asked!)
7) Provides mappings for all characters: e.g. fj has the
mapping "fj" and not the not-found character.
Unlike Ann, I don't think the W3C is the best group to figure out
a good mapping. This is because they are just one of the stakeholders,
and they must have an interest in just rubberstamping whatever
HTML has done.
> I find it frankly astonishing that Unicode 3 didn't take
> as a _requirement_ that it support all the characters that had ISO
> entity definitions.
On the other hand, part of the point of the ISO entities is that they
are intentionally flexible, so that people can choose the best possible
mappings from characters that actually have counterparts in fonts.
It might have seemed perverse to them to add characters intended
to be ambiguous. :-)