Sunday morning

My friends,

if you have nothing better to do, you might join me in a thought experiment.

Imagine a chaotic sequence of characters, like so:

dupojdzuasohjwpusjsusojjgsjgatreetgkjta�statkguaotjhoverwhppwugjsqtgelmingithsopatazahtz

Now add markup:

dupojdzuasohjwpusjsusojjgsjga<p81>tree</p81>tgkjtoverwhppwugjsqtgelmingithsopatazahtz

Ah - mixed content! But wait ... go on:

dupojdzuasohjwpusjsusojjgsjga<p81>tree</p81>tgkjt<p16><c>overwh</c>ppwugjsqtg<c>elming</c></p16>ithsopatazahtz

So if we imagine the sequence to be the content of an element, say <chromosome_5>, then we observe:

//chromosome_5/p81 => "tree"

//chromosome_5/p16/string-join(c, "") => "overwhelming"

The thought experiment provided a pretty accurate image of DNA sequences and their evalution by a living cell. The experiment replaces nucleic acids by latin letters. The nucleic alphabet has only four letters, but the main thing is - it *has* letters, syntactic items, sequences of which constitute semantic units - genes, which are like chemical words. Some interesting facts:

* If we look at the genome of bacteria, we only encounter elements like <p81> - with pure text content (single-level markup).

* If we look at the genome of higher organisms, we encounter also elements like <p16> (so-called "interrupted genes" [1]) - a top-level element containing a sequence of text and simple content elements (two-level markup).

* The higher the organism, the greater the proportion of interrupted genes

What are the equivalents of start tag and end tag? Fuzzy answer. There are generic letter sequences delimiting the start and end of a gene - comparable to the <pxy> tags, but not quite, as tag names serve to distinguish elements, whereas those start-stop sequences are generic. In so far they correspond to the generic parts of tags, the angle brackets and slashes in <...> and </...>. How about the specific part of tags - the name? Adjacent to the generic start sequence, there are specific (non-generic) sequences of nucleic acids which are "read" by specific proteins whose response (absence or presence, perhaps also shape assumed) controls in any given moment whether the gene is blocked, read with low intensity, or read with high intensity, dependent on the requirements of the cell. So it is in fact more like p81<>, than <p81>. (Note the fuzziness, as where does the element name start exactly?) Concerning the "inner" <c> elements, I think their molecular definition is not yet very well understood.

One might argue that markup was invented (or published) more than four billion years ago. And it is a striking fact that interrupted genes occur only in higher organisms... - why?. It looks as if the transition from single-level to two-level markup was a precondition for evolution beyond bacteria - as if it were an invention which increased the chances of creating meaningful sequences by many orders of magnitude.

But my main conclusion is that we actually do not understand markup very well, as we can say so little about its role in the evolution of life. We know that the abstract discoveries of mathematics are the foundation of physics. I wonder if there is not a deeper and more abstract understanding of markup possible - an understanding of its role in creating order out of chaos, reducing entropy, or what not - which might contribute to the understanding of molecular evolution. Is there something like "mathematics of markup", which perhaps quantifies the "ordering potential" of applying markup to arbitrary sequences of items, perhaps in terms of probabilities, and compares the potential of single-level and two-level markup? If not, might it be?

Sometimes I wonder if open-minded experts from the markup domain should not enter conversations with molecular biologists, offering perspectives which would be very difficult to detect without much working experience with markup.

If anybody among you would like to get a better understanding of molecular genetics, let me mention a book which I like and read, and which is very accessible to non-biologists like us (Genes IX, by Benjamin Lewin),

Hans-J�rgen Rennau

[1] More exactly: an interrupted gene is a sequence of DNA first read as a long coherent piece and then submitted to a postprocessing which extracts several subsequences and stitches them together, in order of their occurrence, like so:

let $rawGene := //chromosome_5/p16
return
$rawGene/string-join(c, "") => "overwhelming"