My friends,if you have nothing better to do, you might join me in a thought experiment.
The thought experiment provided a pretty accurate image of DNA sequences and their evalution by a living cell. The experiment replaces nucleic acids by latin letters. The nucleic alphabet has only four letters, but the main thing is - it *has* letters, syntactic items, sequences of which constitute semantic units - genes, which are like chemical words. Some interesting facts:* If we look at the genome of bacteria, we only encounter elements like <p81> - with pure text content (single-level markup).* If we look at the genome of higher organisms, we encounter also elements like <p16> (so-called "interrupted genes" [1]) - a top-level element containing a sequence of text and simple content elements (two-level markup).* The higher the organism, the greater the proportion of interrupted genesWhat are the equivalents of start tag and end tag? Fuzzy answer.There are generic letter sequences delimiting the start and end of a gene
- comparable to the <pxy> tags, but not quite, as tag names serve to distinguish elements, whereas those start-stop sequences are generic. In so far they correspond to the generic parts of tags, the angle brackets and slashes in <...> and </...>. How about the specific part of tags - the name? Adjacent to the generic start sequence, there are specific (non-generic) sequences of nucleic acids which are "read" by specific proteins whose response (absence or presence, perhaps also shape assumed) controls in any given moment whether the gene is blocked, read with low intensity, or read with high intensity, dependent on the requirements of the cell.
So it is in fact more like p81<>, than <p81>. (Note the fuzziness, as where does the element name start exactly?) Concerning the "inner" <c> elements, I think their molecular definition is not yet very well understood.
One might argue that markup was invented (or published) more than four billion years ago. And it is a striking fact that interrupted genes occur only in higher organisms... - why?. It looks as if the transition from single-level to two-level markup was a precondition for evolution beyond bacteria - as if it were an invention which increased the chances of creating meaningful sequences by many orders of magnitude.
But my main conclusion is that we actually do not understand markup very well, as we can say so little about its role in the evolution of life. We know that the abstract discoveries of mathematics are the foundation of physics. I wonder if there is not a deeper and more abstract understanding of markup possible - an understanding of its role in creating order out of chaos, reducing entropy, or what not - which might contribute to the understanding of molecular evolution. Is there something like "mathematics of markup", which perhaps quantifies the "ordering potential" of applying markup to arbitrary sequences of items, perhaps in terms of probabilities, and compares the potential of single-level and two-level markup? If not, might it be?
Sometimes I wonder if open-minded experts from the markup domain should not enter conversations with molecular biologists, offering perspectives which would be very difficult to detect without much working experience with markup.