xml-dev - Re: [xml-dev] A heavier-weight proposal for character entity definition

Re: [xml-dev] A heavier-weight proposal for character entity definition

[ Lists Home | Date Index | Thread Index ]

To: James Clark <jjc@jclark.com>
Subject: Re: [xml-dev] A heavier-weight proposal for character entity definition
From: ht@cogsci.ed.ac.uk (Henry S. Thompson)
Date: 06 Feb 2002 11:21:32 +0000
Cc: xml-dev@lists.xml.org
In-reply-to: <269036974.1012997921@[192.168.0.198]>
References: <269036974.1012997921@[192.168.0.198]>
User-agent: Gnus/5.0808 (Gnus v5.8.8) XEmacs/21.4 (Civil Service)

James Clark <jjc@jclark.com> writes:

> Before getting into the details of a schema for an XML syntax for
> declaring character entities, I think we should step and ask what the
> real requirements are.

For sure.  I think there are a number of obvious use cases, from which
we might derive requirements:

1) Hand-authoring an XML document, and need to include a few
well-known useful non-ASCII characters, e.g. &eacute;, &bullet;,
&copyright;

2) Post-processing arbitrary XML to make it encoding='ISO-646' or
'ISO-8859-1';

3) Authoring MathML, with or without helpful UI.

4) Marshalling implementation data, e.g. from a database, whose string
fields may have arbitrary Unicode, where e.g. ISO-8859-1 is the
required encoding (similar to (2)).

<snip/>

> - if you have user-defined character entity names, then users will
> start demanding the ability to preserve those names, which means that
> the DOM/SAX/Infoset will need to record which entity name if any was
> used for a character

As now, that demand can be responded to sensibly by saying editors are
not vanilla applications.

> So I'm wondering whether a more constrained approach to character
> entities would work.  Suppose for example there is a standard
> W3C-defined builtin entity set; this would have a version number and
> would add new characters from time to time (but never change existing
> entity names).  There would be a standard mapping from a version
> number to a URI where a XML specification of the entity set would be
> available.  However, parsers wouldn't have to fetch and parse this,
> they could just recognize the version number and refer to an
> appropriate compiled-in table.  The XML declaration would declare the
> version number of the builtin entity set that was being used; if the
> XML declaration didn't specify a version number, only the 5 XML 1.0
> builtin entities could be used. Just as now, the SAX/DOM/infoset
> wouldn't record whether a particular character was entered literally
> or using a builtin entity reference. Instead programs that serialize
> XML (like XSLT) would have options saying when to use builtin entity
> references to represent characters.

I think this works for use-cases (2) and (4) above, but at a pretty
high cost.  Conformant parsers will have no choice but to read or
build-in the complete set (40K names or so, at the moment, is it?) in
order to handle any entity references at all.  This seems too high a
cost for cases (1) and (3) above.

> For the first version of the standard builtin entity set we could start with
> 
> - HTML entities
> - MathML entities
> - maybe a set of entity names algorithmically generated from the
> standard Unicode names in Unicode 3.2; 0xe01; which has a Unicode name
> of "THAI CHARACTER KO KAI" might be entered as &thai_character_ko_kai;.

I'm also concerned that centralising maintenance and updating of this
mechanism is a recipe for frustration and interop nightmares.

What about a middle way, combining the two proposals:

1) Some document type for entity definitions is adopted by W3C;
2) XML n.m is appropriately modified to provide for exploitation of
   such definitions;
3) W3C publishes definitions of at least the three sets you name above
   at stable URIs with a public versioning policy;
4) Then full-featured parsers that want to can build in tables for the
   published URIs, but light-weight parsers that don't want to can
   operate a "read only what's required" policy, thereby handling the
   simple cases simply.

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/

References:
- Re: [xml-dev] A heavier-weight proposal for character entitydefinition
  - From: James Clark <jjc@jclark.com>

Prev by Date: Re: [xml-dev] A heavier-weight proposal for character entity definition
Next by Date: Re: [xml-dev] A heavier-weight proposal for character entity definition
Previous by thread: Re: [xml-dev] A heavier-weight proposal for character entitydefinition
Next by thread: Re: [xml-dev] A heavier-weight proposal for character entity definition
Index(es):
- Date
- Thread