[
Lists Home |
Date Index |
Thread Index
]
[I do not, I'm afraid, read xml-dev regularly so please forgive me if
this covers recent discussions or I am simply out of touch].
I am struggling with how to continue to formalize the semantics of
Chemical Markup Language (CML). The issues are generic and no
chemistry is required to understand them. They bear some relationship
to "microformats" but involve issues of strong typing and code
generation, so "context-free objects" is more descriptive.
Currently CML (in XSD) consists of about 100 elements, 100 attributes
and about 100 simpleTypes (e.g. elementTypeType is one of 117 symbols
("H", He", ...) and angleType is an xsd:double in the range
0-180). The current components (in XSD syntax) are at:
http://cml.cvs.sourceforge.net/cml/schema25/
Chemistry is a largely context-free discipline in that we can locate
(say) <molecule>) in many places in a document. There are a very
large number of ways of using CML components but the major ones in
current practice are:
* compound documents (e.g. scientific publications) composed of a
range of markup languages (XHTML, SVG, MathML, CML, etc.). Many
publishers are now actively starting to adopt this approach. A key
approach is that data and text are mixed ("datument") so that we can
transmit data in primary publications. Machines can now start to
understand scientific publications.
* storage of (fairly) well-defined data objects (molecules, spectra,
etc.) in databases
* management of the chemical computational process including formal
semantics of objects in the program.
This is sufficiently broad that it is impossible to create a
traditional XSD schema which allows for all uses. Since CML continues
to evolve it we cannot guess a complete schema description and then
constrain people to use it. However since almost all CML must be
processed by specialist software at some stage we require conformance
to a specification. Moreover there are no user communities who
require all CML functionality at once and so we assume that
particular groups will use subsets of the language.
We have used XSD rather than Relax because the conventional world
feels happier with it (sorry), and because we have to provide
software support for CML - there are more reusable components
generally available for XSD. However we only use a small subset of
XSD syntax (basically the stuff I can understand), limited to:
* definition of elements containing explicit complexType and
references to element children
* definitions of types
* definition of attributes
There is no single schema, but users can choose which subset of CML
elements they wish to use. This is simply done by concatenation of
the components (we deliberately do not use xsd:import).
The specification is used for the following:
* validation of documents
* (complete semantic) documentation of the language (IOW the
specification should be a machine-understandable description of the
language.) It is inspired by the ideas of literate programming and
will use <appino> etc. This is not complete and this mail is to seek guidance.
* generation of code. This is critically important as all elements
have to have classes, and all attributes have to have typed accessors
and mutators. Although we could use Castor, XMLBeans, etc. for Java
we have to support Python, C++ and FORTRAN so that I have written our
own code generator to provide this.
Of course there is much chemical functionality that is not provided
by a semantic specification and this has to be handcoded on a
per-element basis.
At present XSD is used for the specification of CML although we have
also attempted to use schematron and XSLT-like expressions for some
of the constraints that cannot be expressed in XSD. (XSD is good for
formal documents such as tax-forms but it is poor for the evolution
of a scientific language). Currently we find:
* most of the datatyping can be done with simpleType and this works
well - there is no reason to change most of this
* we find little use (at present) for re-usable complexTypes.
* XSD content models are effectively useless for validation. They
rapidly become enormous for some elements and no-one would use them.
* there are many simple relationships that cannot be expressed in XSD.
There are no cases where we insist on the order of child elements (I
can never remember the order anyway so it's unfair to require others
to).There are also very few cases where the cardinality of children
matters (wherever we have tried these we come up with counter
examples). We forbid mixed content in CML and so elements are of 3 types:
* empty
* one or more element children
* one text child
(If CML requires running marked up text we use <xhtml:div> or similar)
Currently the attributes and content models are used to generate
code. Thus <propertyList> can have (say) a title attribute, and
children such as <metadataList> and <property>. This generates code such as:
PropertyList.setTitle(String title)
MetadataList PropertyList.getMetadataList()
PropertyList.add(Property)
This is enormously valuable when programming as it helps to ensure
strong typing and provides prompts and checking when writing code.
Therefore we continue to need a specification that describes the
relationship of one element to another and, where appropriate,
supports the generation of code.
Here are some examples of relationships which I currently need to
express and which should, if possible, be enforceable in code.
* element must have a parent from (list...)
* element may have parent from (list...)
* element must not have parent from list
* element may have children from (list) (and this will generate code)
* element must not have children from list.
*element may either have a foo attribute or a <foo> child accessible
through a single getFoo() method
*element must have either a foo attribute or a bar attribute.
* Many elements are of the form <foo ref="a1"/>. In this case an
element <foo id="a1"/> must occur within the document. We do not use
XML-IDs for this as we cannot rely on the documents having unique
ids. (Some of our algorithms find the "nearest" element with a given id)
* Values may be required to be distinct. Thus in <foo refs="a1 a2 a3
a4"/> all values in the list must be distinct. (This sort of thing
takes half a ;age in schematron)
(There is also a need for chemical restrictions and validations, but
I omit these here).
I am therefore looking for a way of specifying semantics of this type
in <appinfo> elements on some or all elements. It is important that
the semantics are not procedural (we cannot assume that the users
have Python, etc.). There is currently no requirement for speed, so
XSLT is a possible solution although it is very difficult to evaluate
scientific functions in it.
I believe that there could be value in a lightweight declarative
language in <appinfo> elements which would support validation *and
code generation*. If this already exists that would be wonderful - if
it doesn't I hope the above makes sense.
P.
Peter Murray-Rust
Unilever Centre for Molecular Sciences Informatics
University of Cambridge,
Lensfield Road, Cambridge CB2 1EW, UK
+44-1223-763069
|