[
Lists Home |
Date Index |
Thread Index
]
I have just sent this off to the XML Editor mail list. I encourage
anyone who thinks it
is good or bad (or who just thinks there should be something but doesn't
care what)
to also send to them.
It also raises an interesting question: the XML spec is written in
draconian terms with,
nominally, very few options. Yet SAX 2, the almost universally deployed
parser
interface, is highly parameterizable with features, handlers and
properties. So it
cannot be too tragic to accept that some systems may need to bend
certain rules,
without altering the basic definitions.
Rick
===============================================================
Request for Erratum to XML 1.0 and 1.1 Specs
----------------------------------------------
Rick Jelliffe, ricko@topologi.com, 2003-10-21
I request the XML Working Group please consider the following erratum
to XML 1.0 which should also apply to XML 1.1.
The following two paragraphs, or something to the same effect, should be
appended to section 5.1 "Validating and Non-Validating Processors"
"A non-validating processor may, at user option, imply definitions for
all the character entities defined by HTML 4[1]. A document or entity
for which definitions are implied is not well-formed. The processor must
report a non-fatal error. NOTE: The document is 'not well-formed but
processed'. Reliance on this feature by specifications is deprecated;
this option may be withdrawn at some
future time should it prove dangerous."
"A non-validating processor which provides the HTML 4
definitions may, at user option, also imply definitions for other
Math ML and ISO standard sets[2]. A processor must report a non-fatal
error. The document is 'not well-formed but processed'. NOTE: Reliance
on this feature by specifications is deprecated; this option may be
withdrawn at some future time should it prove dangerous."
[1] http://www.w3.org/TR/html401/sgml/entities.html
[2] http://www.w3.org/TR/MathML2/chapter6.html#chars_entity-tables
This suggested erratum has the following characteristics:
1) It does not require any change to any XML processor
2) It does not change the basic XML characteristic that the
only way to guarantee information is received at the other
end is to use a UTF-* encoding, no entities and no attribute
defaulting.
3) It maintains the current layering, ao no re-architecting
or change in design is needed
4) It keeps the XML specification as the location on how to
go from characters to data+markup.
5) It does not make any existing valid XML document invalid
6) It does not make any existing invalid XML document valid
7) It does not make any existing WF document or entity non-WF
8) It does not make any existing non-WF document formally WF
9) It does allow the continued non-validating processing of
documents which are non-WF only because they contain standard
references
10) It limits this to user option
11) It does not allow other specifications to use this as
its default
12) It can be withdrawn
13) I believe it is practical and would be simple to implement.
I believe the beneficiaries of such an erratum include:
* Users typing in editors with no adequate input methods
for non-ASCII characters. I note that although Unicode
editors can display many characters, not all operating
systems have input methods to allow convenient data entry
even of Latin1 characters. (I believe this is better provided
by using decent XML markup editors, without prejudice.)
* XHTML users who are used to named references without declarations
in HTML.
* Potential XInclude users, who may wish
to treat a WF parsed entity from a document that uses
standard character references as a microdocument
* Potential XML Schemas, Schematron and RELAX NG users who
may wish to upgrade from DTDs.
* Potential XQuery users who are being hindered by the lack
of XML Schemas.
* XML pipeline systems which can pass XML without requiring
tricky prologs
* SOAP, RSS and RDF systems which must cope with data fragments
from externally-generated document being embedded
* Programmers serializing data to XML, especially for internal
systems, who may prefer to generate "—" or " "
rather than the numeric or literal equivalents.
* Vendors who make products for the above
* Low-sight or motion-impaired users whose speech synthesizers
or input methods only support ASCII characters. Aged, enraged
or diminished capacity users who may be frustrated at having
to lookup the number for something they know the name for.
(Though I do not want to suggest that "entity rage" is a hidden
problem.)
I suggest its benefits over other suggested approaches include:
* It does not require change to subsequent processes, as PSVI
processing would, nor any changes or additions to schema
specifications
* It does not require pre-processing, as a macro processor would
* It does not require the introdution and deployment of new
transcoders, as would Tim Bray and John Cowan's recent thought
experiment "UTF-8+Names"
* It does not require interaction with other standards groups, notably
XML Schemas EG or IANA or IETF.
* By providing it at user option, it can succeed or fail; if it is
popular and successful, that is good; if it is unpopular or unsafe.
* By limiting itself to the HTML and the MathML/ISO entities, it
avoids issues of user-defined entities, and the need to enumerate
the entities.
* It does not define mappings for those characters, but defers to
HTML and MathML/ISO, who may provide standard mappings.
This gives a very wide constituency:
I note that Xerces' SAX 2 provide features by which a parser can
continue processing after an error. This proposal could be seen as
a very limit nod of recognition of that kind of practise.
Cheers
Rick Jelliffe
|