xml-dev - Re: The Power of Groves

Re: The Power of Groves
[ Lists Home | Date Index | Thread Index ]
From: "W. Eliot Kimber" <eliot@isogen.com>
To: xml-dev@xml.org
Date: Sun, 06 Feb 2000 15:03:03 -0600
Steve Schafer wrote:

> I was rereading some old material on groves, and came across the
> following in a post by Eliot Kimber to comp.text.sgml (it was at the
> end of a paragraph discussing the definition of customized property
> sets for various kinds of data; the full context is available at
> http://www.oasis-open.org/cover/grovesKimber1.html):
> 
> "However, there is no guarantee that the property set and grove
> mechanism is capable of expressing all aspects of any notation other
> than SGML."
> 
> (Notes 440 and 442 in section A.4 of the HyTime spec say much the same
> thing.)
> 
> On the face of it, this is a perfectly sensible thing to say. At the
> same time, however, it is rather disturbing, because it suggests that
> there might exist data sets for which the grove paradigm is wholly
> unsuited. I would certainly hate to expend a lot of effort building a
> grove-based data model for a data set, only to discover part way
> through that groves and property sets simply won't work for that data
> set.

The point of this statement is that we could not at the time *guarantee*
that groves could express all aspects of a given notation. In fact I'm
quite sure that, just like XML, there does not exist a form of data for
which a usable grove representation could not be defined. We did not
have the time or skills to mathematically prove that groves could be
used for everything. I for one did not want to make an absolute claim I
couldn't prove.

It is likely that a grove-based representation would not be *optimal*
for many kinds of data.

But that doesn't really matter because the purpose of groves is to
enable processing of data for specific purposes (addressing, linking,
transforming) and therefore does not need to necessarily express all
aspects of any particular notation, only those aspects that are needed
by the processing for which the grove has been constructed. Different
types of processing might even use different grove representations of
the same notation to suit their own specific needs.

It's important to remember that a grove is an abstraction of data (or
the result of processing data), not the data itself.

Also, whether or not a grove representation is useful or appropriate
depends as much on the implementation as it does on the details of
groves themselves. For example, it might not seem reasonable to
represent a movie as a grove where every frame is a node, but in fact a
clever grove implementation could make that representation about as
effecient as some more optimized format.  For example, you need not
preconstruct all the nodes, doing so only when necessary. Also, as
computers become faster, the cost of abstraction goes down for the same
volume of data. Ten years ago streaming media had to be superoptimized
just to be playable at all. Today we don't need that level of
optimization (what we have been doing is putting more and more
information into the same presentation time (MPEG movies) or doing more
and more compression (MP3)). 

It's also important to remember that any form of data representation,
standardized or not, will be optimized for some things and non-optimized
for others. Groves were explicitly optimized for representing data that
is like SGML and XML. It happens that SGML and XML data is more
complicated and demanding that most other kinds of data, so it's likely
that anything that satisfies those requirements will ably satisfy the
requirements of most types of data, certainly most types of structured
data.

But it's no guarantee, at least not without some mathematical proof that
I am not qualified or able to provide (not being a mathematician).

> So the first question is this:
> 
> 1) Does a Universal Data Abstraction exist?
 
> Note that, like a Universal Turing Machine, such an abstraction need
> not be particularly efficient or otherwise well suited to any specific
> task. The only requirement is that it be universal in the sense of
> being capable of representing any conceivable data set (or at least
> any "reasonable" data set). (And no, I don't have a formal definition
> of what "reasonable" would mean in this context; all I can say is that
> the definition itself should be reasonable....) The real importance of
> a Universal Data Abstraction is that it would provide a formal basis
> for the construction of one or more Practical Data Abstractions.

First, let me stress the importance of the last sentence: that is, I
think, the key motivator for things like groves. I want things like the
DOM, which are extremely practical, but I want them bound to a formal,
testable, standardized abstraction.

I know of two standardized universal, implementation-independent data
abstractions: groves and the EXPRESS entities (ISO 10303 Part 11).  Both
of these standards provide a simple but complete data abstraction that
is completely divorced from implementation details. For groves its nodes
with properties. For EXPRESS its entities with attributes. Both can be
used to represent any kind of data structure. These two representations
have different characteristics and were designed to meet different
purposes. There is currently an active preliminary work item within the
ISO 10303 committee (ISO TC184/SC4) to define a formal mapping between
groves and EXPRESS entities so that, for example, one can automatically
provide a grove view of EXPRESS data or an EXPRESS view of groves.

XML *appears* to be a universal data abstraction, but it's not quite,
because it is already specialized from nodes with properties to
elements, attributes, and data characters. This is why Len's recent
comment about an XML representation of VRML not working well with the
DOM is not at all surprising. Of course it doesn't. The DOM reflects the
data model of XML (elements, attributes, and data characters) not the
data model of VRML. This is always the case for XML.

I have observed that the world desperately needs a universal data
abstraction. I think that one of the reasons that XML has gotten so much
attention is that it *looks like* such an abstraction (even though it's
not).

I also don't think it really matters what the abstraction looks like in
detail--what's important is that we agree on what it is as a society.
Once we have that we can stop worrying about stupid details like how to
specify the abstract model for XML or RDF or Xlink or XSL or what have
you: you'll just do it.  

It doesn't matter whether we use groves as is or EXPRESS entities as is
or make something up that we can all agree on. What's important is that
we do it and stick to it.  I think that groves are a pretty good first
cut, but we could certainly improve on them. The advantage that groves
have at the moment is that they are standardized, they have been
implemented in a number of tools, including James Clark's Jade, HyBrick
from Fujistu, the Python grove stuff from STEP Infotek, my PHyLIS tool,
TechnoTeacher's GroveMinder product, Alex Milowski's now-unavailable
code he wrote before he got bought by CommerceOne, and others I'm sure.
It satisfies immediate requirements well, it has at least two useful
standards built around it, and it's a reasonably good base for future
refinement (about to get under way with the DSSSL 2 project being led by
Didier Martin).

> Assuming that the answer is "yes" (and I have no real justification
> other than optimism to believe that it is), the second question
> follows immediately:
> 
> 2) Does the grove paradigm, or something similar to the grove
> paradigm, constitute a Universal Data Abstraction?

Yes, obviously.
 
> 3) Does there exist any "reasonable" data set for which the grove
> paradigm inherently cannot provide an adequate representation?

You'd have to define adequate, but I don't think so. Groves obviously do
hierarchical stuff quite well. Relational tables are just shallow
hierarchies. Streaming media is more of a problem, but even it can be
decomposed into groups of frames or data units (e.g., movie goes to
scenes, scene goes to frames, frames carry sound and image properties).
 
> When attempting to answer this third question, it is important to
> avoid getting caught up in unwarranted toplogical arguments. The
> topology of groves may not map onto the topology of a particular data
> set, but that does not mean that that data set is unrepresentable as a
> grove. Consider XML: An XML document consists of a linear, ordered
> list of Unicode characters, yet the XML format is quite capable of
> representing any arbitrary directed acyclic graph.

This is a very important point and it's well worth stressing again. Any
"universal" data abstraction will be suboptimal for many types of data
or data structures. That's what implementations are for, getting the
optimization characteristics needed by specific applications or use
environments. 

The main purpose, in my mind, for a universal abstraction like groves is
to enable reliable addressing (because you have some common basis on
which to define and predict the structures of things) and to enable the
creation of data access APIs that may be individually optimized for
different use scenarios but that are all provably consistent because
they all reflect the same underlying data model.
 
> ========
> 
> On a somewhat related note, I've noticed that in discussions regarding
> the Power of Groves, the arguments by the proponents seem to fall into
> two distinct groups. On the one hand, some people see groves as being
> quite universal in their applicability. On the other, some people talk
> about groves almost exclusively within the context of SGML, DSSSL
> and/or HyTime. As an outsider and relative latecomer to the party, I
> find it difficult to determine whether this dichotomy of viewpoints is
> real, or merely reflects the differences in the contexts in which the
> discussions have taken place. If the schism _is_ real, it would be
> helpful if those sitting on either side of the fence could add their
> thoughts regarding why the schism is there, and why the people on the
> other side are wrong. :)

I think it's largely a function of context. But it's important to
remember that groves were defined as part of a larger standards
framework of which SGML, DSSSL, and HyTime are the chief parts. There is
a sense in which these three standards cover pretty much all of data
representation and access at the abstract level (as opposed to the
implementation level, where we rely on things like APIs, programming
langauges, communications protocols, and other building blocks of
working systems).  But groves certainly have general application outside
the use of the DSSSL and HyTime standards. It's just that the ability to
implement those standards is what has motivated most of us who have
implemented groves.

Because groves can be applied to any kind of data (per the discussion
above) it follows that the DSSSL and HyTime standards can be applied to
any kind of data. That is, I can do generalized, consistent linking,
addressing, styling, and transforming of anything I can put into a
grove, which is anything. That covers almost all of what one needs to do
to data in an application. This provides tremendous leverage once you
have the layers of infrastructure built up.

> An example of why I am concerned by this question is given by the
> property set definition requirements in section A.4 of HyTime. The
> definition of property sets is given explicitly in terms of SGML. That
> is, a property set definition _is_ an SGML document. But it seems to
> me that if property sets have any sort of widespread applicability
> outside of SGML, then a property set definition in UML or IDL or some
> other notation would serve just as well (assuming that those other
> notations are sufficiently expressive; I'm fairly confident that UML
> is, but I'm not so sure about IDL).

I agree completely. That is one reason we're working on rationalizing
EXPRESS and groves. As part of that effort, we have created EXPRESS
models for the SGML and HyTime property sets, providing an example of
using a more generalized formal modeling language to specify the data
models the groves reflect. You could, of course, do the same thing with
UML and define a generic algorithm for going from a UML model to a grove
representation of the data objects conforming to that model. One key
problem we ran into with EXPRESS (and would run into with UML) is that
groves have the explicit and essential notion of name spaces (for
addressing nodes by name, not disambiguating names). EXPRESS has no
formal notion of grove-style name spaces, nor does UML. You can define
the appropriate constraints using population constraints (OCL in UML),
but it's not nearly as convenient as in a property set definition
document.

> Of course, it can be argued that _some_ notation had to be used, so
> why not SGML? My response to that is that I believe that the
> mathematical approach of starting with a few extremely basic axioms
> and building on those as required to develop a relevant "language" for
> expressing a model would be far superior, as it would allow people to
> fully visualize the construction of the property set data model (or
> "metamodel," if you prefer), without getting bogged down in arcane
> SGML jargon. After all, SGML can hardly be described as minimalist.

Again, I couldn't agree more. We have what we have largely because we
were in a hurry and it was expedient (and because it's what James Clark
did and, at the time, the rest of the editors didn't have anything
better to offer). It's too bad that we didn't appreciate the existence
or applicability of EXPRESS at the time, because if we had we very well
might have used it. 

But in any case, it would be easy enough to revise the spec to provide a
more complete and modern formalism. There's no particular magic to the
property set definition document except that, being in SGML/XML form, it
was easy for us to process and work with. 

> (An aside: I believe that a lot of the resistance to acceptance of
> SGML and HyTime has its basis in the limitation of identifiers to
> eight characters, leading to such incomprehensible abominations as
> "rflocspn" and "nmndlist." Learning a completely new body of ideas is
> hard enough without having to simultaneously learn a foreign--not to
> mention utterly unpronounceable--language.)

Almost certainly true. We felt that we had an obligation for backward
compatibilty with legacy SGML, which meant that we had to have names
that could be used with the reference concrete syntax.  Not sure that we
could have done otherwise. It's a historical legacy just like 512 scan
lines for TV signals. In practice it probably wouldn't have caused
anyone harm if we had required support for longer names. 
 
Cheers,

Eliot
Follow-Ups:
- Re: The Power of Groves
  - From: "Steven R. Newcomb" <srn@techno.com>
- Re: The Power of Groves
  - From: Len Bullard <cbullard@hiwaay.net>
References:
- The Power of Groves
  - From: Steve Schafer <pandeng@telepath.com>
Prev by Date: Re: SAX2: Interning names in namespaces
Next by Date: Re: SAX2: relative ordering of startDocument() & startDTD() events?
Previous by thread: The Power of Groves
Next by thread: Re: The Power of Groves
Index(es):
- Date
- Thread