### From the Shannon Perspective

While the unique structure of the RNA pol II CTD has long been the object of intense interest, a formal, mathematical examination of the CTD with regards to its informational potential has not been presented. To initiate this process I begin by providing a summary of Shannon information theory as it pertains to the CTD of RNA pol II.

A mathematical conceptualization of both information and communication were first presented by Claude Shannon in his classic 1948 paper “A Mathematical Theory of Communication” [25]. In this manuscript Shannon introduces the quantity, *H* (or entropy), as a measure of information, choice and uncertainty. Using the simplifying assumption that each symbol in a string of characters (i.e. a message) has an equal chance of appearing, *H* can be calculated using the formula.

where *M* is equal to the number of symbols in the alphabet used to write the message, and *L* is equal to the number of characters in the string. For example, the value of *H* for a 10 character binary string is calculated as.

\begin{array}{c}H=10\xb7{log}_{2}2\\ =10\phantom{\rule{0.5em}{0ex}}bits\end{array}

Thus, a receiver awaiting the communication of the string – and having no prior knowledge as to the contents of the string in question – would receive 10 bits of information upon reading the message. Similarly, the value of *H* for a 12 character string of DNA (constructed using the letters A, G, C, or T), is calculated as

\begin{array}{c}H=12\xb7{log}_{2}4\\ =24\phantom{\rule{0.5em}{0ex}}bits\end{array}

The choice of logarithmic base is arbitrary and simply determines the units of measurement (i.e. bits if the base two is chosen, nats if the base *e* is chosen, and digits if the base 10 is chosen).

Using the above paradigm, it is a relatively simple process to apply these concepts to the RNA pol II CTD. Abstractly, at least, one could consider the CTD as a string consisting of *x* repeats of the heptapeptide sequence, Y_{1}S_{2}P_{3}T_{4}S_{5}P_{6}S_{7}. The symbol, or letter, that appears at each position in the string will of course depend on the post-translational modifications of the heptad in question. The CTD can be differentially phosphorylated on Ser-2, Ser-5, and/or Ser-7 residues [4, 8–10, 12, 14, 26–28]. In addition to these modifications, P_{3} and P_{6} residues may be in either a *cis* or a *trans* configuration (controlled by peptidyl prolyl *cis* *trans* isomerases). Taking these facts into consideration, it is apparent that one of a total of 32 possible symbols may appear in each heptad repeat. While further post-translational modifications of the CTD are possible (e.g. glycosylation, Y_{1} phosphorylation, T_{4} phosphorylation) these have not been considered for the sake of simplicity.

Proceeding with this train of thought, we can calculate the quantity, *H*, for the human CTD (which consists of 52 heptad repeats) to be

\begin{array}{c}H=52\xb7{log}_{2}32\\ =260\phantom{\rule{0.5em}{0ex}}bits\end{array}

Thus, one can reason that the human CTD has 260 bits of informational potential.

In addition to this potential, it should be noted that all elements of a general communication system as described by Shannon, are present within the eukaryotic cell (Figure 1A). In this biological incarnation of the system, the information source would be comprised of the upstream signalling pathways that converge upon regulatory CTD kinases, phosphatases, and/or *cis*-*trans* isomerases. In this way the message – transmitted through the modulation of the activity of CTD effectors – could be received and decoded by Rpb1p in the form of a discrete CTD phosphorylation pattern. Critically, this decoded message could be used by the cell to influence Rpb1p transcriptional activity in an evolutionarily selectable fashion.

While I have shown that each heptad of the CTD has the potential to encode 5 bits of information, this calculation represents an idealized case where each symbol in the alphabet has an equal chance of appearing at any given position in the string. Since the phosphorylation/*cis-trans* configuration of specific residues in the CTD is not necessarily independent, this idealized scenario is likely not the case in vivo. In cases where each symbol in the alphabet does not have an equal chance of appearing, the quantity *H*, or entropy, can be defined by

H=-\sum _{i=1}^{n}{p}_{i}\xb7log\left({p}_{i}\right)

Where *n* is equal to the number of symbols in the alphabet and *p*_{
i
} represents the probability of the *i* th symbol becoming part of the string. For example, in the binary case, maximum *H* will occur when the two alternate symbols are equally likely to appear. This can be seen from an intuitive standpoint by considering a two symbol system in which one of the symbols never appears. Such a system would be unable to encode information as only strings of a single symbol could be produced. At the other extreme, the same binary system in which both symbols are equally likely, could encode 1 bit/symbol. Intermediate systems (i.e. where one symbol is less likely than the other, but not zero) would be able to encode greater than 0, but less than 1 bit/symbol. Furthermore, in cases where *n* is greater than 2, it can be shown that for any given *n*, *H* will be at a maximum when the probabilities of the symbols (the choice of letters) are equal (i.e. *p*_{
i
} is equal to 1/*n*).

A further concern that must be considered involves the potential for ambiguity in the string. If ambiguity exists, then the amount of information (*R*) transmitted will be determined by the decrease in uncertainty of the receiver according to the equation

R={H}_{\mathit{Before}}-{H}_{\mathit{After}}

For example, if a receiver is expecting a binary string of 20 characters (where *p*_{
0
} = *p*_{
1
} = 0.5), but 3 of the characters are ambiguous upon receipt (i.e. the receiver is unable to determine whether they are 0's or 1's), then the information received is calculated as.

\begin{array}{c}R=20\phantom{\rule{0.2em}{0ex}}{log}_{2}2-3\phantom{\rule{0.2em}{0ex}}{log}_{2}2\\ =17\phantom{\rule{0.5em}{0ex}}bits\end{array}

If, on the other hand, the string is sent with no ambiguity then *R* is simply equal to *H*_{
Before
}*.*

Given these mathematical constraints – and merging them with our current understanding of the biological reality – it is possible to envision a cellular communication system in which the entropy within the CTD is harnessed to transmit information to the transcriptional machinery. This hypothesis, which I refer to as the “ear of the king” hypothesis (i.e. where the CTD is thought of as a means of gaining access to the “king”, RNA pol II) posits that CTD kinases/phosphatases/*cis*-*trans* isomerases, influenced by upstream signalling pathways, modify the CTD so that the total cellular population of Rpb1p molecules are organized into an ordered, or set of ordered, configurative states (i.e. a discrete set of post-translational modifications specific to a given environmental/developmental/physiological condition). These configurations could in turn modulate the activity of the RNA pol II complex and thus effect discrete changes in the expression of specific gene subsets. In this model each set of ordered configurative states would correspond to a specific expression profile and would provide a selective advantage in a given growth environment.

In effect, this is to say that – depending on the overall activity of the regulatory kinases/phosphatases/*cis*-*trans* isomerases affecting the CTD – RNA pol II could be “programmed” through natural selection to output a specific gene expression profile (Figure 1B). Furthermore, the longer the CTD, the greater the potential entropy and the more diverse the set of configurative states possible. Thus, the correlation between developmental complexity and CTD length are entirely consistent with – and can be logically derived from – these principles.

### From the Kolmogorov-Chaitin Perspective

In addition to the Shannon theory, an independently derived theory of information (referred to as algorithmic information theory, or simply the Kolmogorov-Chaitin theory) has also been presented [26, 29–31]. In this theory – just as in the Shannon theory – the uncertainty within a string of characters correlates with the capacity to encode information. In this case the uncertainty is measured by one’s ability to compress or simplify a string.

For example, the string “abcabcabcabcabcabcabcabc” could be compressed to (abc)_{8}, whereas a random string (e.g. “arfgjkaaczxfoms”) could not be expressed in any form simpler than merely restating the string. In this paradigm we can informally state that the complexity of a string is equal to the length of the shortest string capable of describing it. Thus, the first example of a string, “abcabcabcabcabcabcabcabc”, would be considered less complex than the second example, “arfgjkaaczxfoms”. Strings are said to be Kolmogorov random if they are incompressible.

Along similar lines to the logic used when discussing the Shannon paradigm, the great potential complexity within the CTD could be harnessed to convey information by “programming” the CTD through the controlled and dynamic modulation of CTD effectors. For example, specific “configurative states” of the CTD (annotated as follows for example; [Y_{1:OH}S_{2:PO4}P_{3:cis}T_{4}S_{5:OH}P_{6:trans}S_{7:OH}]_{38} [Y_{1:OH}S_{2:OH}P_{3:trans}T_{4}S_{5:PO4}P_{6:trans}S_{7:PO4}]_{14}) could be seen as selectable “programs” capable of being read by the transcriptional machinery and corresponding to a specific transcriptional output. In this case the quantity of information transmitted could be calculated by determining the difference in complexity between a randomized CTD configuration, and a given non-randomized sequence. This is to say, a change from a randomized CTD configuration to a simpler configuration would constitute a message to Rpb1p to carry out a certain transcriptional program. In this way, controlled modulation of the entropy of the CTD could be used to drive broad developmental/metabolic/physiological changes to gene expression in a selectable manner.