OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Request for Erratum to XML 1.0 and 1.1 Specs

[ Lists Home | Date Index | Thread Index ]

I have just sent this off to the XML Editor mail list. I encourage 
anyone who thinks it
is good or bad (or who just thinks there should be something but doesn't 
care what)
to also send to them.

It also raises an interesting question: the XML spec is written in 
draconian terms with,
nominally, very few options. Yet SAX 2, the almost universally deployed 
interface, is highly parameterizable with features, handlers and 
properties. So it
cannot be too tragic to accept that some systems may need to bend 
certain rules,
without altering the basic definitions.



Request for Erratum to XML 1.0 and 1.1 Specs
Rick Jelliffe, ricko@topologi.com, 2003-10-21

I request the XML Working Group please consider the following erratum
to XML 1.0 which should also apply to XML 1.1.

The following two paragraphs, or something to the same effect, should be 
appended to section 5.1 "Validating and Non-Validating Processors"

"A non-validating processor may, at user option, imply definitions for
all the character entities defined by HTML 4[1]. A document or entity 
for which definitions are implied is not well-formed. The processor must 
report a non-fatal error. NOTE: The document is 'not well-formed but 
processed'. Reliance on this feature by specifications is deprecated; 
this option may be withdrawn at some
future time should it prove dangerous."

"A non-validating processor which provides the HTML 4
definitions may, at user option, also imply definitions for other
Math ML and ISO standard sets[2]. A processor must report a non-fatal
error. The document is 'not well-formed but processed'. NOTE: Reliance 
on this feature by specifications is deprecated; this option may be 
withdrawn at some future time should it prove dangerous."

[1] http://www.w3.org/TR/html401/sgml/entities.html
[2] http://www.w3.org/TR/MathML2/chapter6.html#chars_entity-tables

This suggested erratum has the following characteristics:

1) It does not require any change to any XML processor
2) It does not change the basic XML characteristic that the
only way to guarantee information is received at the other
end is to use a UTF-* encoding, no entities and no attribute
3) It maintains the current layering, ao no re-architecting
or change in design is needed
4) It keeps the XML specification as the location on how to
go from characters to data+markup.

5) It does not make any existing valid XML document invalid
6) It does not make any existing invalid XML document valid
7) It does not make any existing WF document or entity non-WF
8) It does not make any existing non-WF document formally WF

9) It does allow the continued non-validating processing of
documents which are non-WF only because they contain standard
10) It limits this to user option
11) It does not allow other specifications to use this as
its default
12) It can be withdrawn

13) I believe it is practical and would be simple to implement.

I believe the beneficiaries of such an erratum include:

 * Users typing in editors with no adequate input methods
 for non-ASCII characters. I note that although Unicode
 editors can display many characters, not all operating
 systems have input methods to allow convenient data entry
 even of Latin1 characters. (I believe this is better provided
 by using decent XML markup editors, without prejudice.)

 * XHTML users who are used to named references without declarations
 in HTML.

 * Potential XInclude users, who may wish
 to treat a WF parsed entity from a document that uses
 standard character references as a microdocument

 * Potential XML Schemas, Schematron and RELAX NG users who
 may wish to upgrade from DTDs.

 * Potential XQuery users who are being hindered by the lack
 of XML Schemas.

 * XML pipeline systems which can pass XML without requiring
  tricky prologs

 * SOAP, RSS and RDF systems which must cope with data fragments
 from externally-generated document being embedded

 * Programmers serializing data to XML, especially for internal
  systems, who may prefer to generate "—" or " "
  rather than the numeric or literal equivalents.

 * Vendors who make products for the above

 * Low-sight or motion-impaired users whose speech synthesizers
  or input methods only support ASCII characters. Aged, enraged
  or diminished capacity users who may be frustrated at having
  to lookup the number for something they know the name for.
  (Though I do not want to suggest that "entity rage" is a hidden

I suggest its benefits over other suggested approaches include:

 * It does not require change to subsequent processes, as PSVI
  processing would, nor any changes or additions to schema

 * It does not require pre-processing, as a macro processor would

 * It does not require the introdution and deployment of new
  transcoders, as would Tim Bray and John Cowan's recent thought
  experiment "UTF-8+Names"

 * It does not require interaction with other standards groups, notably
  XML Schemas EG or IANA or IETF.

 * By providing it at user option, it can succeed or fail; if it is
 popular and successful, that is good; if it is unpopular or unsafe.

 * By limiting itself to the HTML and the MathML/ISO entities, it
  avoids issues of user-defined entities, and the need to enumerate
  the entities.

 * It does not define mappings for those characters, but defers to
  HTML and MathML/ISO, who may provide standard mappings.

This gives a very wide constituency:

I note that Xerces' SAX 2 provide features by which a parser can
continue processing after an error. This proposal could be seen as
a very limit nod of recognition of that kind of practise.

Rick Jelliffe


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS