OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: Why the Infoset?

[ Lists Home | Date Index | Thread Index ]
  • From: "W. E. Perry" <wperry@fiduciary.com>
  • To: XML DEV <xml-dev@lists.xml.org>
  • Date: Sat, 29 Jul 2000 03:03:14 -0400

"Paul W. Abrahams" wrote:

> Which is the horse and which the cart here?  Especially given its ancestry as a
> more civilized form of SGML, XML is seen by the world as a set of textual
> conventions for recording documents.  The Infoset is related to an
> implementer's view of the abstract syntax tree.  But even then, I believe that
> people were writing XML parsers, and therefore creating abstract syntax trees,
> before the Infoset ever existed.
> Looking at it another way, how would the XML world be poorer if the Infoset did
> not exist?

It is far worse than that, I fear. The Infoset is the cuckoo's egg in the XML nest.
The fundamental innovation of XML 1.0 was the concept of well-formedness, which as
a radical insight amounts to this: the instance text--that is, content plus
markup--is entirely self-sufficient both as syntax and as the basis for derived or
elaborated semantics. The inherent bias of SGML is toward a pre-ordained content
model. The DTD-based validation which XML inherited from SGML imposes as a first
and principal demand on the instance document that it be a proper concrete
expression of an established form. I call such a priori expectations 'intent', and
the XML family of specifications abounds with often mutually-exclusive and
mutually-contradictory attempts to impose such preconceptions. They range from
DTD-based validation at the milder end of the spectrum to attempts such as SOAP to
force an XML document to mandate specific processing at the time of its use--to
become, in effect, an executable.

By contrast, the concept of well-formedness introduced by XML 1.0 permitted that
original XML definition to be understood as a specification of syntax rather than
of expected semantics. It offered the possibility of XML which, as fundamentally
distinct from SGML, might have no expectations of an instance document other than
well-made syntax. That, in turn, offered the possibility that the true content
model of an instance document might be uniquely derived at the time and place of
its use. The intent of the document creator for the interpretation of the instance
document--whether that intent was expressed as a content model in a DTD, or as a
schema imposed upon the instance document, or as a stylesheet specifying a
pre-ordained transformation, or even a presentation, of that document--might
legitimately be ignored, partially-ignored, or modified in ways appropriate to the
unique local circumstances where the document consumer processed or otherwise made
use of that document. This is the closest that we have come in the field of markup
to realizing the separation of content (which cannot be more minimally conveyed
than as syntax) from presentation (in its larger sense of the elaboration of
semantics from that syntax).

This understanding of radically simple well-formed XML leads to other wonderful
discoveries as well. For example, just as the XML name promises, the language or
markup vocabulary of a document is extensible on the spot, in the instance, through
nothing other than the application of markup itself. Since no DTD nor other content
model or pre-ordained schema is required for the parsing, and therefore the
interpretation, of the resulting instance document, it is not necessary to secure
anyone's agreement to the extension of the content model before simply extending
the markup vocabulary of the instance document. XML 1.0 is wonderfully silent on
how that novel markup is to be understood by a consumer of the document, thereby
leaving the question of what the local semantics of the document will be in the
circumstances of its use quite properly in the hands of each of its users.

The Infoset is the unfortunate standard to which those in retreat from the radical
and most useful implications of well-formedness have rallied. At its core the
Infoset insists that there is 'more' to XML than the straightforward syntax of
well-formedness. By imposing its canonical semantics the Infoset obviates the
infinite other semantic outcomes which might be elaborated in particular unique
circumstances from an instance of well-formed XML 1.0 syntax. The question we
should be asking is not whether the Infoset has chosen the correct canonical
semantics, but whether the syntactic possibilities of XML 1.0 should be curtailed
in this way at all.


Walter Perry


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS