[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: typing (was RE: Personal reply)

From: Sean McGrath <sean@digitome.com>
To: "Simon St.Laurent" <simonstl@simonstl.com>,"David E. Cleary" <davec@progress.com>, XML DEV <xml-dev@lists.xml.org>
Date: Tue, 13 Mar 2001 11:11:43 +0000

At 03:23 PM 3/12/01 -0500, Simon St.Laurent wrote:

>Telling me "don't use the feature if you don't like it" isn't a reasonable
>answer to the kinds of problems we're addressing here.

This is one of the most important questions we have to address.

We are all conscientious software developers, we like to build
things that work reliably. To do that, we have to be very careful with
optional features in XML and related technologies. It is oh, so easy
to say that software is "100 per cent XML compliant" but
fiendishly difficult to live up to that promise in anything but
marketing bumph.

Pipeline processing is a good example of a technique where
optional features of XML bite and bite hard. The work you need to
do to do the right thing in the presence of validating XML 1.0
parsers is orders of magnitude larger than if you just work
with WF XML.  I am not talking about the parsing act
itself - I am talking about the infoset that is yielded
which needs to be nurtured through the processing.

I may want to use DTDs (I often do!) but specifying a DTD
opens a wasps nest in the infoset. All of a sudden - just
to get content model validation in my pipeline, I need to worry about
general entities, internal document type declaration subsets,
include marked sections, entity resolution, public/system
identifiers, defaulted attribute values etc. etc. I need to
worry about these because I may well need to reflect
their presense in the XML my processing produces.

Compounding this is that fact that as my pipeline progresses,
I am typically morphing document structures from one
form to the next. Most of the time, as a pipeline is in
progress, there is no content model in the XML 1.0 sense
of the word. I can certainly use intermediate content model
validation to great benefit but XML 1.0 actually gets in the
way of doing so.

Heres why. With SGML, I could keep all the content model stuff in
separate entities from the document instances. I can
feed them as separate things to an SGML parser.
With XML today, the situation is at once better and a lot
worse. I love the stuff that is going on with the
alternative schema languages/validators and plan
to use 'em all to varying degrees in pipeline processing.

Unfortunately, DTD schemata have a privilidged place *in*
the document instances. This creates no end of
round-tripping-the-important-stuff problems. A solution
to this is to be found in the abstract infoset paradigm.
This is the road the grove men tool in the SGML days.
A simpler solution is also to be found, I believe, by looking
at the problem differently.

What if schema stuff including DTDs is *always* outside
the instance?

Does this simplify the infoset issues? yes
Does this allow a variety of schema approaches to be used on
a mix and match basis during pipeline processing? - yes
Does it allow the same instance to be viewed through
the eyes of both local and global semantics via different
schemata? yes
Does it appeal to simpletons? yes

The optionality of DTD validation, coupled with its explicit binding
to document instances, coupled with its explosive effect
on the complexity of infosets, is the nub of the problem
in my opinion.

I know analogies with SGML/HTML have been flogged to
death but... HTML is an SGML application  - or so
some would have it. Ever tried using general entities in
HTML? How about DTD subsets? CDATA sections?
Ever tried to declare your JPGs as unparsed entities?

HTML parsers don't do any of that stuff. There is no
point in putting them in your HTML even though
SGML says you can. As a consequence of simply
ignoring all these "optional" features, HTML parsers
yield a simple infoset. Yes, I know that the absence
of start and end-tags makes the DAG variable from
one parser to the another but the core infoset is
*simple*.

We want the DAG to be formally defined - not only
for HTML but for all pointy bracket tag languages.
WF XML takes care of that. But the continuing
optionality of all the embedded DTD stuff makes
the infoset surrounding the DAG complicated.

A lot more complicated than I for one, feel happy
with. XML in its mission statement set out to
have as few optional features as possible - preferably
zero. The mother of all optional features has
unfortunately crept under the radar.

What would it take (I am addressing this question to
those with an intimate knowledge of the XML 1.0 spec.)
to allow validating XML 1.0 parsers to be handed
two URIs. One for the DTD and one for the instance.

This I believe, would be a great first step towards
separating the expression of data and model.
It would also make DTD level validation a peer
of other validation/mapping/transclusion
technologies rather than an eminence.

regards,
Sean (deprecate DOCTYPE) McGrath.

Follow-Ups:
- RE: typing
  - From: Jonathan Borden <jborden@mediaone.net>

References:
- RE: Personal reply to Edd Dumbill's XML Hack Article wrt W3C XML Schema
  - From: "David E. Cleary" <davec@progress.com>
- RE: Personal reply to Edd Dumbill's XML Hack Article wrt W3C XML Schema
  - From: "Simon St.Laurent" <simonstl@simonstl.com>
- typing (was RE: Personal reply)
  - From: "Simon St.Laurent" <simonstl@simonstl.com>

Prev by Date: Re: XLink processor
Next by Date: RE: How to execute a perl script in Java?
Previous by thread: Re: typing (was RE: Personal reply)
Next by thread: RE: typing
Index(es):
- Date
- Thread