Markup, an abstraction

Thanks to everyone who responded to my little thought experiment! Among the responses, one sentence written by Peter Hunsberger struck me as revealing a fundamental mismatch between his and my conception of the word "markup". Peter wrote

"Markup is a semantic term we apply to the way we humans label certain groups of things. The biological processes have no more understanding of the internal arrangement of the genome than the car understands that the brick wall it just ran into was the end of the road."

This expresses a conventional use of the word "markup", whereas I propose an alternative which is a rigorous abstraction, of which the human use of markup is but a particular instance. I realize that my previous posting had failed to make it clear that everything I said was based on such an abstraction, and therefore I now try to clarify. Here comes a very tentative sketch how such an abstraction might look. Take it as indicating the direction of thought, rather than giving a definitive shape.

The proposed abstraction of markup can be expressed by a small set of concepts. Their relationship to the real world I will exemplify by referring to two real systems:
* S1 - a web service implementation receiving request messages containing bibliographic data and storing them in a relational database
* S2 - a chromosome in a living cell

Here we go.

Concept #1: info sequence
A sequence of items, each one of which can be mapped to a particular choice from a fixed set of possible choices. The information content of an info sequence may therefore be captured as a sequence of numbers, where the n-th number signifies the choice represented by the n-th item.
S1: info sequence = serialized request message; information content = sequence of unicode codepoints
S2: info sequence = chromosome; information content = sequence of nucleotides

Concept #2: info sequence evaluator
An agent whose behaviour is controlled by the information content of an info sequence, and whose behaviour can be modelled in terms of distinct info sequence responses (see concept #3)
S1: info sequence evaluator = web service implementation
S2: info sequence evaluator = cell nucleus and its genetic apparatus

Concept #3: info sequence effect
An effect which a particular subsequence of an info sequence has in the presence of a particular info sequence evaluator, where the effect depends in a precisely describable way on the information content of the triggering subsequence
S1: info sequence effect = the transfer of tag contents into a particular database column
S2: info sequence effect = a protein produced by expressing a particular gene

Concept #4: effective info subsequence
A subsequence of an info sequence which triggers an info sequence effect (in the presence of a particular evaluator)
S1: effective info subsequence = tag contents (e.g. 'Miller M' in <author>)
S2: effective info subsequence = gene

Concept #5: info sequence markup
Subsequences of an info sequence whose positions delimit effective info subsequences and whose contents may influence the effects of those effective info subsequences
S1: info sequence markup = markup (in the conventional sense)
S2: info sequence markup = start sequence and adjacent regulatory sequences, stop sequence, within-gene markers controlling the "splicing" of subsequences (exons)

When I spoke about "mathematics of markup", I imagined theoretical work which defines markup systems in abstract terms (perhaps comparable to my sketch), which proceeds to define properties of the system components (info sequence, effects, etc.) and which analyses relationships between those properties, e.g. between properties of the info sequence and properties of the info sequence effects. Such work might, conceivably, investigate the emergence of markup as an evolutionary process in a stochastic system, modelling successive generations of info sequences and assessing the "value" of info sequence effects in some way. Such an approach might perhaps even attempt answers to the question if two-level markup will lead to the discovery of meaningful genes ("valuable" effective info subsequences) significantly faster than single-level markup.

I still think that markup defined in the sense sketched above was invented or published more than four million years ago, that it is written into the very core of biological life and that this fact might kindle an interest in "markup" as a phenomenon existing beyond the limits of the conventional use of the term.

Hans-J�rgen Rennau