xml-dev - Re: [xml-dev] A heavier-weight proposal for character entitydefinition

Re: [xml-dev] A heavier-weight proposal for character entitydefinition

[ Lists Home | Date Index | Thread Index ]

To: "Henry S. Thompson" <ht@cogsci.ed.ac.uk>, xml-dev@lists.xml.org
Subject: Re: [xml-dev] A heavier-weight proposal for character entitydefinition
From: James Clark <jjc@jclark.com>
Date: Wed, 06 Feb 2002 12:18:41 +0700

Before getting into the details of a schema for an XML syntax for declaring 
character entities, I think we should step and ask what the real 
requirements are.

What XML did to SGML was preserve SGML's extensibility where it was really 
needed (for elements and attributes) but remove it where people could get 
by without it (eg delimiter syntax). Which category do character entity 
names for in? It is not obvious to me that there is a requirement that 
character entities be user extensible to the same extent that elements and 
attributes are. Consder the following points:

- in SGML days most people used the standard entity sets

- at any point in time the set of things that are being referenced by 
character entities is closed (i.e. the set of Unicode characters) modulo 
private use characters (which are typically deprecated on the Web), 
although it may evolve over time; this is quite different from the 
situation with elements and attributes

- Unicode provides a standard set of names for all Unicode characters

- I don't see the compelling user requirement for different users to be 
able to user different names for the same character

- having the 5 builtin entities in XML has worked out pretty well; in 
particular, there is no need to clutter the infoset or DOM with them; they 
are just generated as needed on output

- if you have user-defined character entity names, then users will start 
demanding the ability to preserve those names, which means that the 
DOM/SAX/Infoset will need to record which entity name if any was used for a 
character

So I'm wondering whether a more constrained approach to character entities 
would work.  Suppose for example there is a standard W3C-defined builtin 
entity set; this would have a version number and would add new characters 
from time to time (but never change existing entity names).  There would be 
a standard mapping from a version number to a URI where a XML specification 
of the entity set would be available.  However, parsers wouldn't have to 
fetch and parse this, they could just recognize the version number and 
refer to an appropriate compiled-in table.  The XML declaration would 
declare the version number of the builtin entity set that was being used; 
if the XML declaration didn't specify a version number, only the 5 XML 1.0 
builtin entities could be used. Just as now, the SAX/DOM/infoset wouldn't 
record whether a particular character was entered literally or using a 
builtin entity reference. Instead programs that serialize XML (like XSLT) 
would have options saying when to use builtin entity references to 
represent characters.

For the first version of the standard builtin entity set we could start with

- HTML entities
- MathML entities
- maybe a set of entity names algorithmically generated from the standard 
Unicode names in Unicode 3.2; 0xe01; which has a Unicode name of "THAI 
CHARACTER KO KAI" might be entered as &thai_character_ko_kai;.

James

Follow-Ups:
- Re: [xml-dev] A heavier-weight proposal for character entity definition
  - From: "Rick Jelliffe" <ricko@allette.com.au>
- Re: [xml-dev] A heavier-weight proposal for character entity definition
  - From: ht@cogsci.ed.ac.uk (Henry S. Thompson)

Prev by Date: Re: [xml-dev] Getting rid of entities (was: misprocessing namespaces)
Next by Date: Re: [xml-dev] Categories of Web Service messages: data-oriented vs action-oriented
Previous by thread: FW: [xml-dev] DTD validation without a DOCTYPE
Next by thread: Re: [xml-dev] A heavier-weight proposal for character entity definition
Index(es):
- Date
- Thread