Re: [xml-dev] Fixing what's broke

This is a note about simplicity and power, and associated tradeoffs for future XML development.

Objectives

Simplicity is an easy objective for all to agree upon. However simplicity lies largely in the eye of the beholder. Some find COBOL simple; others LISP. Albert Einstein found simplicity in General Relativity Theory.

A view of simplicity can be more precisely defined in terms of what is to be accomplished.

To deal with just data content, particularly if it is simply text, then, as some would suggest, an XML subset can be very simple indeed. JSON is somewhat more complicated than that but a step towards considerably more power.

To deal with specifications about data content, such as XHTML, then more is required.

Finally, to find a way of bringing the power of many existing XML capabilities to a wide audience of application developers, a still different view of simplicity is needed. This note addresses a much larger question than just the base XML language, but also an integration of consistent support for all the useful standards that derive from it.

So, I contend that any approach to “simplification” is meaningful only in the context of some goal to be supported.

Approach

In simplest terms here, I would judge simplicity by readability, and power by capabilities to develop applications for the Web.

Why an application focus ? Simple – application developers are the users of XML along with piecemeal XML solutions to issues related to data presentation, data declaration and validation, data content, data manipulation, syntax, and fundamentally of modularity. And, the current capabilities come far short of the realizable potential for fundamental enhancement of the application development paradigm.

Of course the stakeholders, that cannot be ignored, are the commercial browser developers. It’s hard to tell what motivates them, other than their reluctance to pursue new ventures, when something else is going well for them. However, it would seem that, of the motivations that are possible, a foundation that integrates and fosters application development, should be of interest.

The approach, that I am advocating, is to view XML as a specification language for models as a powerful tool for Web application development. This approach needs to analyze carefully the capabilities that have been developed over the last decade, to abstract from them, and to understand basic simplicities. These notions can then can be generalized to provide even more powerful capabilities with no added cost in complexity. And, current XML standards are full of such starting points.

That said, I believe also, that what much of what is being debated on this list for minimal XML foundations can be a significant part of this effort, especially for the data content focused crowd.

Although this might seem to be a “new XML”, its focus is largely on a consolidation and systemization of what exists already, and thereby laying foundations for more rapid growth of further extensions and capabilities.

In particular, however, I tend to be extremely suspicious of incremental attempts to graft simplicity on top of complexity. These can appear seductively useful in limited contexts, but in the long run they generally tend to produce redundancy and inconsistency with the broader base. (To illustrate this, see the notes on comments in this list.)

Some Observations

XML started as a markup language for text, with an angle bracket notation useful for text markup.

· In terms of syntax - imbedding markup in text with an angle bracket syntax is natural, while imbedding text in an angle bracket syntax is, perhaps can we agree, at least awkward.

· In terms of semantics – XML has evolved, unsystematically and perhaps even chaotically, to provide an impressive and powerful set of models that support various application development capabilities - just that they are hard to understand and use.

If the focus is to be on the user, i.e., the application developer, then what is needed is a language for specification of models, especially for physical and logic data structures, for presentation, for control and for communication.

This suggests what would appear at first to be a somewhat radical notion; i.e., a recognition that even the terminology derived from concepts of a markup language has long been obsolete, and it should be reinvented to reflect foundations for a language for models. For instance,

· “document” is at best an awkward reference to the more general concept of an identifiable resource that can be serve a data stream (not necessarily of text) that can be parsed.

· “attribute” and contained “element” have arbitrary distinctions that are more specific and constrained than the fundamental concepts of “object”, “property” and “behavior”.

· “namespaces” should imply no more than a modular scope of unique names that can be referenced with some simple extensions, such as a “using” statement. (See more below.)

· “schema” should be dropped as such, and replaced with other constructs such as “metadata”, “declarations and constraints” needed for parsing and validation, and sets of “properties” for processing (such as presentation).

Secondly, any development needs to provide a clear separation of concerns;

· Syntax and semantics are separate issues, particularly if it is recognized that infosets that work as objects can be a middle ground.

· Data content and specifications for the use of data content are different.

In particular, the above two points combine with the observation that there is no single syntax possible for data content.

This is partly for compatibility reasons. But fundamentally, data exists on the Web in many forms and representations, and all need to be accessible. A simple example is data that results from a SQL query. More general is the possibility for support for application specific parsers that can extract useful data from complex documents. And as advocated in this list, there are contexts where a minimal subset is useful.

· Applications rely on models for logical and physical data structures, for presentation, for communication, and for control.

Specification models, as evidenced by HTML for presentation, provide impressive capabilities for application development without the need, or with minimal needs, for procedural code.

These models are clearly separate but interdependent. What is also separate is what they have in common. This implies an approach to generalize from these and other models, and thereby to discover the fundamental capabilities of a language with support for all of them.

· Modularization support in XML standards has evolved in strange and wonderful ways.

The fundamental capability is for a set of specifications that can be easily used and integrated with each other. Notes below suggest that this can be accomplished in a more complete and straightforward manner than through existing specifications for namespaces, CSS, Xlink, etc.

· A specification language is useful for programmers and also for those who have no programming experience. This implies that any specification have a “primer” that describes a complete set of capabilities that are easy to use and understand, leaving more powerful capabilities to other documentation.

Thirdly, extensibility is fundamental. If for no other reason than compatibility, the new must be extensible so that it can be easily used and integrated with existing data and specifications. Several starting points for extensibility include:

· Given basic support of fundamentals in a new language, old syntax and semantics can be largely convertible to it.

· A module can specify its own parser, a semantic analyzer and a processor (e.g., intelligent “CDATA”)..

· An element can have a related executable library to support properties and behavior.

This, along with reasonable “constraint” expressions, can allow many new standards to evolve without dependencies on new browser support.

Starting Points

Some starting points that both simplify and provide power:

· Basic names need to be simple, uniform and uncluttered with punctuation.

Thus they can consist of alpha characters, numeric digits and the ever popular underscore. Other characters such as & : ” ‘ and . can be used to create name expressions in specific contexts.

Some conventions, such as “camel notation” are useful, for applications and can be consistently used in specifications.

Some restrictions such as a leading underscore only for “key words’ might be necessary.

Existing names can be escaped, typically with quotes, or, more generally, with some construct such as - &Name( existing name ) .

· Standard base data types from a variety of sources need to be consolidated.

· A fundamental data type for reference, which can be specialized in a variety of ways, is critical.

In particular a reference can be a name (including a URI), a link, an expression, such as path or query, a function that returns a reference, etc.

The syntax needs to distinguish the reference from the referent.

· Parameters are typed values which can derive from either the specification context or referenced data values or both.

Parameters can be generally substituted for any syntactic unit.

· Adaptation support, especially by non non-progrmmers.

Examples range from parameters in configuration files to allow specifications to be easily adapted to particular environments or users, to skeletons such as a interactive tables that mimics what was once called “query by example”, to “wizards” that prompt users to complete a specification.

· Basic expressions, including arithmetic, comparison and boolean need to be consolidated.

Extended expressions are useful for selection, query, etc.

Reference expressions can look somewhat like

name.node . . . node

where node is name | link |name(qualifier) | name(qualifier)(selected_contents) |

join(condition) | node(function )| node(pre-function, post-function) |

function // (that returns a node)

The pre and post functions above support navigation with node entry and exit functions for the traversal.

The result of the above can be viewed as a hierarchy or a table join.

· Templates are syntactic units that can be parameterized with substitution and selection.

· Namespaces need not exist as such.

“Namespaces” simply specifies properties of constructs, such as data types and packages, that require that the names in the context be unique, unless explicitly overridden. Context names include those that are inherited and explicitly included.

Name spaces have names and aliases that allow themselves and their context names to be referenced. These names can be imported and aliased with “using” statements.

· A module is a set of specifications that can be referenced and used in a variety of ways.

Modules are namespaces. Modules include data types, data structures, and packages which contain data and/or specifications.

Modules can be extended globally (such as for metadata) or within a local context (such as to provide presentation properties).

Modules can be referenced, specialized, parameterized, nested, created, merged, extended, restricted, extracted from, transformed, etc.

· Packages are sets of specifications.

A large application would probably have libraries of packages of similar types, such as data types, available resources, data units, metadata, presentation structures, validity specifications, etc. Then these would integrated with sets of packages that use these specify particular functional capabilities. Finally, there would be packages that combine functional units into applications that can be parameterized for particular environments and users.

Also packages could be organized into hierarchical models from conceptual (standardization level) to abstract (application models) to concrete (with implementation details).

· Data types specify fundamental properties and behavior, which are define the application concept they model. ,

Data types also specify general properties and behavior, which are used to adapt them to specific environments, such as messages, storage, and presentation.

Properties and behavior can be extended explicitly to support inheritance which allows polymorphism.

Properties and behavior can have restrictions. This creates “twins” that can be substituted for their parents but not for each other. (E.G., a circle is a restricted ellipse, but both are shapes).

Fundamental properties and behavior can be implemented in executable libraries. Behavior can include operators for expressions.

· Application models support a Model / View / Control Paradigm.

Presentation models include HTML, Open Office, etc, in an integrated framework

Data models support generic operations on physical and logical data structures and elements for create, insert, remove, copy, index, sort, compare, query, delete, etc. Augmented operations might include transform or execute.

Control models provide execution frameworks, for instance pipe, work flow, and state machines.

Communication models support generic protocols to allow interactions and collaboration.

· A Process is an abstraction that performs an action (i.e. response to an event, with specifications or scripts), either synchronously or asynchronously, in a specified (and possibly restricted) environment.

· Syntax can be considerably enhanced based on parameterized templates (to reduce redundancy), JSON or Java like structure (to improve readability), and through simplification of existing constructs such as namespaces.

· Tools are always an issue for new XML capabilities. Basically, tools provide disparate and, by their very nature, limited support for various XML capabilities. Thus, they can be a constraint on introduction and use of new capabilities.

However, the above suggests that fundamental tools can be a direct extension of the XML base. In particular, specifications are data and, for a tool, specifications are the data content. Given the capability for an interactive environment to have a native capability to manipulate graphic structures, it should be easy to map sets of specifications to diagrams for review, edit, and test.

Tool makers, however, do not get left out as they can still support total application integration and process integration. They just get more to work with.

Summary

It is the contention here that very powerful capabilities can be developed in a significantly simpler context than existing capabilities, that this requires new language features, and that these features extend and can be made compatible with what exists (or is proposed and lacks implementation).