xml-dev - Specifying formal semantics in XML languages

Specifying formal semantics in XML languages
[ Lists Home | Date Index | Thread Index ]
To: xml-dev@lists.xml.org
Subject: Specifying formal semantics in XML languages
From: peter murray-rust <pm286@cam.ac.uk>
Date: Tue, 20 Jun 2006 09:07:53 +0100
Cc: henry Rzepa <h.rzepa@imperial.ac.uk>
[I do not, I'm afraid, read xml-dev regularly so please forgive me if 
this covers recent discussions or I am simply out of touch].

I am struggling with how to continue to formalize the semantics of 
Chemical Markup Language (CML). The issues are generic and no 
chemistry is required to understand them. They bear some relationship 
to "microformats" but involve issues of strong typing and code 
generation, so "context-free objects" is more descriptive.

Currently CML (in XSD) consists of about 100 elements, 100 attributes 
and about 100 simpleTypes (e.g. elementTypeType is one of 117 symbols 
("H", He", ...) and angleType is an xsd:double in the range 
0-180).  The current components (in XSD syntax) are at:
http://cml.cvs.sourceforge.net/cml/schema25/

Chemistry is a largely context-free discipline in that we can locate 
(say) <molecule>) in many places in a document. There are a very 
large number of ways of using CML components but the major ones in 
current practice are:
* compound documents (e.g. scientific publications) composed of a 
range of markup languages (XHTML, SVG, MathML, CML, etc.). Many 
publishers are now actively starting to adopt this approach. A key 
approach is that data and text are mixed ("datument") so that we can 
transmit data in primary publications. Machines can now start to 
understand scientific publications.
* storage of (fairly) well-defined data objects (molecules, spectra, 
etc.) in databases
* management of the chemical computational process including formal 
semantics of objects in the program.

This is sufficiently broad that it is impossible to create a 
traditional XSD schema which allows for all uses. Since CML continues 
to evolve it we cannot guess a complete schema description and then 
constrain people to use it. However since almost all CML must be 
processed by specialist software at some stage we require conformance 
to a specification. Moreover there are no user communities who 
require all CML functionality at once and so we assume that 
particular groups will use subsets of the language.

We have used XSD rather than Relax because the conventional world 
feels happier with it (sorry),  and because we have to provide 
software support for CML - there are more reusable components 
generally available for XSD. However we only use a small subset of 
XSD syntax (basically the stuff I can understand), limited to:
* definition of elements containing explicit complexType and 
references to element children
* definitions of types
* definition of attributes
There is no single schema, but users can choose which subset of CML 
elements they wish to use. This is simply done by concatenation of 
the components (we deliberately do not use xsd:import).

The specification is used for the following:
* validation of documents
* (complete semantic) documentation of the language (IOW the 
specification should be a machine-understandable description of the 
language.) It is inspired by the ideas of literate programming and 
will use <appino> etc. This is not complete and this mail is to seek guidance.
* generation of code. This is critically important as all elements 
have to have classes, and all attributes have to have typed accessors 
and mutators. Although we could use Castor, XMLBeans, etc. for Java 
we have to support Python, C++ and FORTRAN so that I have written our 
own code generator to provide this.

Of course there is much chemical functionality that is not provided 
by a semantic specification and this has to be handcoded on a 
per-element basis.

At present XSD is used for the specification of CML although we have 
also attempted to use schematron and XSLT-like expressions for some 
of the constraints that cannot be expressed in XSD. (XSD is good for 
formal documents such as tax-forms but it is poor for the evolution 
of a scientific language). Currently we find:
* most of the datatyping can be done with simpleType and this works 
well - there is no reason to change most of this
* we find little use (at present) for re-usable complexTypes.
* XSD content models are effectively useless for validation. They 
rapidly become enormous for some elements and no-one would use them.
* there are many simple relationships that cannot be expressed in XSD.

There are no cases where we insist on the order of child elements (I 
can never remember the order anyway so it's unfair to require others 
to).There are also very few cases where the cardinality of children 
matters (wherever we have tried these we come up with counter 
examples). We forbid mixed content in CML and so elements are of 3 types:
* empty
* one or more element children
* one text child

(If CML requires running marked up text we use <xhtml:div> or similar)

Currently the attributes and content models are used to generate 
code. Thus <propertyList> can have (say) a title attribute, and 
children such as <metadataList> and <property>. This generates code such as:

PropertyList.setTitle(String title)
MetadataList PropertyList.getMetadataList()
PropertyList.add(Property)

This is enormously valuable when programming as it helps to ensure 
strong typing and provides prompts and checking when writing code. 
Therefore we continue to need a specification that describes the 
relationship of one element to another and, where appropriate, 
supports the generation of code.

Here are some examples of relationships which I currently need to 
express and which should, if possible, be enforceable in code.

* element must have a parent from (list...)
* element may have parent from (list...)
* element must not have parent from list

* element may have children from (list) (and this will generate code)
* element must not have children from list.

*element may either have a foo attribute or a <foo> child accessible 
through a single getFoo() method
*element must have either a foo attribute or a bar attribute.

* Many elements are of the form <foo ref="a1"/>. In this case an 
element <foo id="a1"/> must occur within the document. We do not use 
XML-IDs for this as we cannot rely on the documents having unique 
ids. (Some of our algorithms find the "nearest" element with a given id)

* Values may be required to be distinct. Thus in <foo refs="a1 a2 a3 
a4"/> all values in the list must be distinct. (This sort of thing 
takes half a ;age in schematron)

(There is also a need for chemical restrictions and validations, but 
I omit these here).

I am therefore looking for a way of specifying semantics of this type 
in <appinfo> elements on some or all elements. It is important that 
the semantics are not procedural (we cannot assume that the users 
have Python, etc.). There is currently no requirement for speed, so 
XSLT is a possible solution although it is very difficult to evaluate 
scientific functions in it.

I believe that there could be value in a lightweight declarative 
language in <appinfo> elements which would support validation *and 
code generation*. If this already exists that would be wonderful - if 
it doesn't I hope the above makes sense.

P.



Peter Murray-Rust
Unilever Centre for Molecular Sciences Informatics
University of Cambridge,
Lensfield Road,  Cambridge CB2 1EW, UK
+44-1223-763069
Follow-Ups:
- Re: [xml-dev] Specifying formal semantics in XML languages
  - From: Rick Jelliffe <rjelliffe@allette.com.au>
Prev by Date: Re: [xml-dev] SVG interoperability
Next by Date: Re: [xml-dev] Specifying formal semantics in XML languages
Previous by thread: RE: [xml-dev] Clustering Customization Vs Global Standards
Next by thread: Re: [xml-dev] Specifying formal semantics in XML languages
Index(es):
- Date
- Thread