Re: [xml-dev] Sunday morning

Comments in-line below...

Peter Hunsberger

On Sun, Aug 25, 2013 at 5:25 AM, Hans-Juergen Rennau <hrennau@yahoo.de> wrote:

My friends,

if you have nothing better to do, you might join me in a thought experiment.

Ok, I'll play for a bit...

�

[...]�

The thought experiment provided a pretty accurate image of DNA sequences and their evalution by a living cell. The experiment replaces nucleic acids by latin letters. The nucleic alphabet has only four letters, but the main thing is - it *has* letters, syntactic items, sequences of which constitute semantic units - genes, which are like chemical words. Some interesting facts:

* If we look at the genome of bacteria, we only encounter elements like <p81> - with pure text content (single-level markup).

* If we look at the genome of higher organisms, we encounter also elements like <p16> (so-called "interrupted genes" [1]) - a top-level element containing a sequence of text and simple content elements (two-level markup).
* The higher the organism, the greater the proportion of interrupted genes

What are the equivalents of start tag and end tag? Fuzzy answer.There are generic letter sequences delimiting the start and end of a gene

There are a couple answers to this. �The parts of the genome that we consider genes have pretty well defined start stop sequences. Biological processes depend on them do know what portions to copy and how to produce proteins. �Above that, researchers label sections of the genome and these sections are well enough defined that we can tell when they are missing, duplicated or mutated. �Any given gene may have multiple of these wel defined sections, so the start / stop isn't just at the gene level.

�

- comparable to the <pxy> tags, but not quite, as tag names serve to distinguish elements, whereas those start-stop sequences are generic. In so far they correspond to the generic parts of tags, the angle brackets and slashes in <...> and </...>. How about the specific part of tags - the name? Adjacent to the generic start sequence, there are specific (non-generic) sequences of nucleic acids which are "read" by specific proteins whose response (absence or presence, perhaps also shape assumed) controls in any given moment whether the gene is blocked, read with low intensity, or read with high intensity, dependent on the requirements of the cell.

Well it's only fuzzy in the sense that there are variations possible. �But these variations are small; between humans and other mammals you are talking about 100'ths of a percent in many regions on only a couple of percent in total.

�

So it is in fact more like p81<>, than <p81>. (Note the fuzziness, as where does the element name start exactly?) Concerning the "inner" <c> elements, I think their molecular definition is not yet very well understood.�

Definition is a sort of strange word here; the sequences have multiple functions. The definition, as much as there is one, is precisely the specific order of the nucleotides.

�

One might argue that markup was invented (or published) more than four billion years ago. And it is a striking fact that interrupted genes occur only in higher organisms... - why?. It looks as if the transition from single-level to two-level markup was a precondition for evolution beyond bacteria - as if it were an invention which increased the chances of creating meaningful sequences by many orders of magnitude.

Don't think so. Markup is a semantic term we apply to the way we humans label certain groups of things. �The biological processes have no more understanding of the internal arrangement of the genome than the car understands that the brick wall it just ran into was the end of the road.

�

But my main conclusion is that we actually do not understand markup very well, as we can say so little about its role in the evolution of life. We know that the abstract discoveries of mathematics are the foundation of physics. I wonder if there is not a deeper and more abstract understanding of markup possible - an understanding of its role in creating order out of chaos, reducing entropy, or what not - which might contribute to the understanding of molecular evolution. Is there something like "mathematics of markup", which perhaps quantifies the "ordering potential" of applying markup to arbitrary sequences of items, perhaps in terms of probabilities, and compares the potential of single-level and two-level markup? If not, might it be?

Sometimes I wonder if open-minded experts from the markup domain should not enter conversations with molecular biologists, offering perspectives which would be very difficult to detect without much working experience with markup.

Well, I'm not sure I'm a markup expert even if I've been playing with it since the dawn of SGML but I have worked on the design of generalized databases for capturing molecular data associated with genomics, protenomics, metabolics, etc.. One primary khe key to such systems is that there is actually very little variation between samples of this data. You can encode the important differences as delta's between samples and it's the commonality (or lack of commonality) of these deltas between and across samples that provides the clues as to what parts are doing what. �In the molecular biology world it's not the markup that matters, it's the one letter difference between the two otherwise identical books.

[...]�