Lists Home |
Date Index |
James Clark <email@example.com> writes:
> Before getting into the details of a schema for an XML syntax for
> declaring character entities, I think we should step and ask what the
> real requirements are.
For sure. I think there are a number of obvious use cases, from which
we might derive requirements:
1) Hand-authoring an XML document, and need to include a few
well-known useful non-ASCII characters, e.g. é, •,
2) Post-processing arbitrary XML to make it encoding='ISO-646' or
3) Authoring MathML, with or without helpful UI.
4) Marshalling implementation data, e.g. from a database, whose string
fields may have arbitrary Unicode, where e.g. ISO-8859-1 is the
required encoding (similar to (2)).
> - if you have user-defined character entity names, then users will
> start demanding the ability to preserve those names, which means that
> the DOM/SAX/Infoset will need to record which entity name if any was
> used for a character
As now, that demand can be responded to sensibly by saying editors are
not vanilla applications.
> So I'm wondering whether a more constrained approach to character
> entities would work. Suppose for example there is a standard
> W3C-defined builtin entity set; this would have a version number and
> would add new characters from time to time (but never change existing
> entity names). There would be a standard mapping from a version
> number to a URI where a XML specification of the entity set would be
> available. However, parsers wouldn't have to fetch and parse this,
> they could just recognize the version number and refer to an
> appropriate compiled-in table. The XML declaration would declare the
> version number of the builtin entity set that was being used; if the
> XML declaration didn't specify a version number, only the 5 XML 1.0
> builtin entities could be used. Just as now, the SAX/DOM/infoset
> wouldn't record whether a particular character was entered literally
> or using a builtin entity reference. Instead programs that serialize
> XML (like XSLT) would have options saying when to use builtin entity
> references to represent characters.
I think this works for use-cases (2) and (4) above, but at a pretty
high cost. Conformant parsers will have no choice but to read or
build-in the complete set (40K names or so, at the moment, is it?) in
order to handle any entity references at all. This seems too high a
cost for cases (1) and (3) above.
> For the first version of the standard builtin entity set we could start with
> - HTML entities
> - MathML entities
> - maybe a set of entity names algorithmically generated from the
> standard Unicode names in Unicode 3.2; 0xe01; which has a Unicode name
> of "THAI CHARACTER KO KAI" might be entered as &thai_character_ko_kai;.
I'm also concerned that centralising maintenance and updating of this
mechanism is a recipe for frustration and interop nightmares.
What about a middle way, combining the two proposals:
1) Some document type for entity definitions is adopted by W3C;
2) XML n.m is appropriately modified to provide for exploitation of
3) W3C publishes definitions of at least the three sets you name above
at stable URIs with a public versioning policy;
4) Then full-featured parsers that want to can build in tables for the
published URIs, but light-weight parsers that don't want to can
operate a "read only what's required" policy, thereby handling the
simple cases simply.
Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
W3C Fellow 1999--2001, part-time member of W3C Team
2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: firstname.lastname@example.org