Re: [xml-dev] Markup, an abstraction

Michael, in your posting I discover several things. Above all the question where to start and which way to head: the concern that "abstraction of markup languages" might be the wrong place to start, being too narrow. And then the desire to focus on what is essential about genetic systems, from the view point of information theory: how complete is the picture of it being essentially one-dimensional information.

I heartily agree with both concerns - in particular the danger of barrenness if trying to feed on the theory and practise of markup languages.

What I have probably not yet made sufficiently clear is that my bringing the word "markup" into play is not motivated by the idea of stretching existing concepts of markup. Rather, it is a mental Kurvenlineal (curve template) helping me to draw firm lines when picturing the natural phenomena, desiring to discern contours and proportions. I really think that "remembering" conventional markup can make a difference when contemplating biological phenomena and trying to capture essential features. Hence my little model presenting notions like "info sequence effect" and "effective info subsequence". So let us start with the "one-dimensional sequence of symbols from some finite alphabet". The concept is fundamental, but still uncomplete, when we look for an information theoretical underpinning of genetic operation. The notion of a sequence of letters does not yet include the notion of units - distinct subsequences - and the role they play in some context, let alone how to delimit those subsequences and the interaction of different subsequences (markup / primary). If we are ready to look afresh at old routine - is it not a remarkable fact that within the primary sequence of genetic letters two different types of distinct subsequences can be discerned, one providing a unit of primary information, the other one serving two purposes at a time, (a) the delimiting of those units, (b) also the "treatment" of the unit by an evaluator ("express the gene or not, here and now?"), without interfering with the information content itself?

Please note how far apart my suggestions are from any preoccupation with markup theory. My proposal is to concentrate on the phenomena, to be aware of striking features (two types of subsequences, one impacting the usage of the other...) and at least to wonder about their significance.

For me, "unit" is a Zauberwort (magic word). The emergence of Life, as well as its maintenance, requires the carving of *units* from apparently unstructured blocks, soups, streams. Markup is all about the carving of distinct units from a continuous stream, isn't it? Again, again: I am NOT interested in markup, I am interested in genetic operations and genetic evolution, and I simply CANNOT HELP recognizing what I call "the markup principle", although the term in itself is irrelevant. It is a principle which has to do with the distinction between two kinds of subsequences, and their interaction in the presence of some agent. Where can this principle be observed elsewhere, except for information technology and molecular genetics? Can it be understood more deeply, is there anything to be understood, beyond the surface? Let us not confuse the limitations of "markup theory" with limitations of the markup phenomenon.

Whenever I look at a big XML file in an XML editor, I am impressed by the effect of collapsing elements, which enables me with a few clicks to recognize an important structure (e.g. the sequence of top-level elements) which would be next to impossible by inspection without such an aid (by just looking and scrolling). This curious (and pleasant) effect is built on the unit-carving property of markup, where now, for a change, a human is the processor, rather than a program, profiting immensely. So, I notice that in a markup-sensitive context, the insertion of, say, 8 characters here and 8 characters several kilometers downstream can create a huge unit which boosts my possibilities: I can say "collapse!" and everything inside is hidden, and "expand!" and everthing inside is revealed.

You pointed to the question "whether biological encodings of information are intrinsically one-dimensional". In this context, a thought experiment: imagine the task to switch 500 genes off (or on), depending on some physiolocial or developmental conditions. Approach #1: visit each one and tweak its individual regulatory behaviour. Approach #2: invent a markup rule so that two matching (complemantary) subsequences (say, ATATGGGC and TATACCCG), whose distance lies in a certain range, can be used to either collapse or expand everything between them. (This is not as unrealistic as it may sound, considering DNA loops and attachments to external structures which may open or close a loop). If such pairs of matching points existed, this would amount to additional "markup" providing a wrapper of many units of "primary" markup. In other words: secondary markup (those complementary sequences which open or close a potential loop structure) could enable the use of a hierarchy of sorts in the task of regulation (addressing container units, rather than contained units, individually). Let us remember a trivial fact: the length of a tag, and the effort of addressing a tag, are not related to the amount of information contained by the element.)

These considerations demonstrate the stimulating effect which awareness of markup may have on our contemplation of genetic processes. In particular, new questions may emerge.

One more thought about the "hidden" qualities of markup, I mean those qualities about which we usually do not think. It is the element of discontinuity which it may introduce at virtually no cost - as the previous example showed, add 8 characters here, add 8 characters there (which are complementary), and you get a huge effect which is impossible without the markup effect, impossible in a setting where every change or addition to the sequence is rewarded by an effect proportional to the extent of the change made, proportional to what you invested. This may have significant implications to the evolution of character sequences.

You suggested to start with "some existing information theory". Can you perhaps make any suggestions to me (online or offline)? For a long time already I have wondered if there *is* something like an information theory which might be relevant in the context of genetic systems. I would be grateful for any suggestions.

Thank you for your insight and your care.
Hans-J�rgen

PS: Before you leave - think of this: the key principle of genetic information is complementarity - A is complementary to T, G to C. There are two classes of genetic recognition, one with digital precision, the other of a more continuous character: digital recognition is based on the matching of complementary DNA sequences - click, click, click, ... - analog recognition is based on the behaviour of proteins, like a gentle or tighter pressing. At any rate, complementarity is beyond any doubt a key principle of genetic information. Most conventional markup (not all - see CSV) is based on the fundamental principle of complementarity - e.g. start tag vs. end tag. Complementarity is fundamental, markup is secondary. You are so right in your warning of overdoing the "markup perspective". Doubtless you are right when preferring "information theory" to "markup theory". But going deeper, markup appears in new light, which may in turn reveal new properties, deserving interest.

Von: Michael Kay <mike@saxonica.com>
An: Hans-Juergen Rennau <hrennau@yahoo.de>
CC: Peter Hunsberger <peter.hunsberger@gmail.com>; "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
Gesendet: 9:57 Mittwoch, 28.August 2013
Betreff: Re: [xml-dev] Markup, an abstraction

What are you trying to achieve? How far do you intend to take your abstraction?

If you are trying to develop a general theory of information, then it's not clear to me that abstraction of markup languages is the best place to start. It might be better to start with some existing information theory and see how markup languages relate to it.

I would be quite interested to know how the "syntax and semantics" of genetic information relate to the classes of formal language that we know about in computer science. In fact, I'd be surprised if there aren't a few PhD theses that explore this. I think that doing such a study in relation to the overall theory of formal languages would be much more productive than doing it specifically in the context of markup languages, which are just one example of that class.

An interesting property of markup languages, and indeed of most "formal languages as used in computer science" is that they generally encode information as a one-dimensional sequence of symbols from some finite alphabet. It seems that this is a property also associated with genetic information. But information representations don't have to be one-dimensional, and we often resort to multi-dimensional representations (diagrams, tables, network data models, music notation) in order to make information more accessible. I guess all such representations can be serialized into a one-dimensional form (indeed, into multiple one-dimensional forms). It would be interesting to know whether biological encodings of information are intrinsically one-dimensional.

So I think there are some interesting avenues to be pursued here, but I think that in pursuing them, you will quickly leave the comfortable world of markup languages far behind and see them just as one not-very-interesting subset of formal languages in general.

Michael Kay
Saxonica