[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Syntax Sugar and XML information models
- From: Michael Champion <firstname.lastname@example.org>
- To: xml-dev <email@example.com>
- Date: Wed, 28 Mar 2001 10:22:12 -0500
The CDATA Sections and the W3C Infoset thread was interesting to me because
we've wrestled with this on the DOM WG, we talked about it on SML-DEV a year
ago, and I just don't have a good sense of how all this works together.
There really is a good case to be made for having various bits of "syntax
sugar" in the XML serialization format, but it complicates the internal
information model very significantly. (I personally feel the same way about
mixed content ... but I won't go there today!).
Conceptually, perhaps we have:
The "Syntax Sugar InfoSet" (SSIS) that exposes everything worth
in the XML syntax... [even different quote characters and whitespace???]
The "Core Infoset" that is more or less what the W3C proposes.
The various flavors of the "Post Schema Validation Infoset"
Post-attribute declaration InfoSet (defaulted attributes applied)
The type-aware infoset (after some type system is applied)
The REAL post-schema-validation infoset (validity constraints
In this scheme, a non-validating XML parser conceptually operates in two
phases -- a "preprocessor" that expands the full syntax into the Syntax
Sugar Infoset (horrible name, I know!), and an InfoSet builder that produces
the canonical view with the syntax sugar dissolved. A validating parser (or
post-parse validator, or whatever) successively builds up the other Infosets
from DTD and/or Schema information. The DOM must be able to understand *all*
these infosets ... and map one to another. This is definitely ugly ... but
at least ugly in a layered way that lets developers and users deal with only
what they need to know. And let me emphasize that I'm talking about
abstract objects and interfaces; it's unlikely that anyone would IMPLEMENT
this with different concrete data structures of objects for each flavor of
Random comment: Much of the flavor of "Minimal XML" and "Common XML" can be
captured by the caveat: "Focus on the Core InfoSet and a canonical
serialization of it; to avoid syntax sugar and APIs that expose the SSIS;
avoid the PSVI and the syntaxes that generate it." (And, not to re-start
any wars, "avoid" means "unless you have a compelling need for this stuff,
don't use it.").
Being kinda stupid, I'm just trying to make sense out of all the "expertise"
out there in a way that doesn't make my poor brain hurt too much. What am I
missing here? Does this add order to the chaos, or is it even more likely
to cause a newcomer to run away screaming from XML, back to the sane world
of Word/Excel, EDI, and SQL?