xml-dev - Dizzy (was Re: Why the Infoset?)

Dizzy (was Re: Why the Infoset?)

[ Lists Home | Date Index | Thread Index ]

From: Rick JELLIFFE <ricko@geotempo.com>
To: xml-dev@xml.org
Date: Wed, 02 Aug 2000 04:14:46 +0800

"Simon St.Laurent" wrote:

> I'm getting kind of dizzy here.  You've objected rather violently to Common
> XML and Minimal XML's subsetting of XML syntax, but you seem to insist on
> the Infoset only providing an abstraction of just such a subset,
> deliberately ignoring the rest.

(Off-topic: Flannery O'Connor used to refer to her excellent book "The
Violent Bear It Away" as "The Violant Bear".)

First, let me start by saying that the new draft of Common XML is really
excellent. I read it last night and I was not subject to any of the fits
of violence that must so terrify other, more mild-mannered,
correspondents.  Anyone starting making an XML system publishing XML
blind (i.e., you do not have control over the receiving systems) is well
advised to read it. I believe Simon commented on SML-DEV that he is
preparing to release the new version soon, and he will no doubt publish
the URL.

As for Minimal XML, if I don't want Microsoft to extend XML
or subset XML syntax, it is only consistent to hold that SML-DEV should
not either (and still call it XML).   Anyone reading the SML-DEV
archives will see that over the course of time almost all the absolute
statements there (PIs bad, notations bad, attributes bad, comments
burdonsome) were found over time to be too extreme to be general, or
applied only to particular use-models or implementation techniques.

So yes I do not agree with subsetting XML syntax. But I certainly agree
that W3C specs should have a consistent view of what information markup
should have. And this view should be consonant with XML as SGML: if we
disconnect XML from SGML it will not fly free like a beautiful bird, it
will be captured for hideous genetic experiments by the rich and
powerful, or their hunchbacked .org fronts.  I probably have written on
this list before that I think it is naive to think that large companies
will not use any ammunition the public gives them to justify
embrace-and-extend(or subset); agreeing on a conservative information
set to be used in W3C specs is one way to corral them. In SGML,
whitespace in markup is "ignored"; consequently it should be ignored in
general-purpose XML information sets. I hope the information set can be
a way to capture some of the valuable semantics of markup clear from
ISO8879 but missed from XML 1.0 (to simplify it).

So if I say that the number of whitespace characters between x and y in 
 <x y="z"/>
should not be part of the information set, what do I mean? I mean that
the
standard DOM should not support it with extra nodes on an Element node,
that there should not be a special axis invented for Xpath for it, that
there need not be a way to XLink to it, that c14n draft should not be
constrained to keep it, that XSchemas should not have a  a way to
constrain how may spaces can appear there or to give a regular
expression for whether newlines can be put there, or that XSLT need
extra elements or attributes to handle it, that CSS does not need extra
selectors for it, etc.   

The XML Infoset has a specific purpose, given in the requirements
document:
http://www.w3.org/TR/NOTE-xml-infoset-req
"It will provide a common reference set that other specifications
can use and extend to construct their underlying data models, and will 
help to ensure interoperability among the various XML-based
specifications
and among XML software tools in general."

The Infoset is aimed at XML specifications and software in general. It
is not its intent to state all the information that anyone could encode
in their document. I would say that in particular it is setting a policy
that W3C XML specs should not operate as if the formatting of the XML
markup was significant. 

This is not a new issue: I remember it being discussed 3 years ago or
so.  It is good for XML editors to regenerate edited documents with the
original formatting of the markup. That is why it is useful if SAX
reports rather than collapses whitespace, and why a DOM implementation
for an interactive editor should subclass the W3C DOM to provide this
information.  That is their infoset, but it is not the one that W3C
Working Groups should start from.

Perhaps John should consider retitling it "XML Information Set for W3C 
Specifications" and its scope would be clearer.

To be glib, Simon & I are mutually dizzy: I would have thought that if
interoperability and simplicity are both desired together, roping in the
information set from niche requirements and questionable uses should be
a Good Thing in Simon's book (not a real book).

Rick Jelliffe

Follow-Ups:
- Common XML - Final Review Draft
  - From: "Simon St.Laurent" <simonstl@simonstl.com>

References:
- RE: Why the Infoset?
  - From: Sean McGrath <sean@digitome.com>
- Re: Why the Infoset?
  - From: "Simon St.Laurent" <simonstl@simonstl.com>

Prev by Date: Re: (off-topic) Re: Recursive Children?
Next by Date: Re: Remember to RELAX (was RE: Are there still a lot of people usingDTD)
Previous by thread: Re: Why the Infoset?
Next by thread: Common XML - Final Review Draft
Index(es):
- Date
- Thread