In order to perform the tasks in our methods, code was written in perl, MATLAB and Python, and is available upon request.
Datasets
For this domain-domain MI investigation we use proteins that have two domains, rather than protein complexes, and treat each domain as a separate protein. In this manner we can be sure that accurate “protein-protein” pairings are used in the MSA. The MSAs are taken from[12], available athttp://www.stats.ox.ac.uk/research/bioinfo/resources, which in turn are based on datasets in[18, 37–39]. The proteins for which each MSA was constructed has a known pdb structure[40] of X-ray resolution 2.5Å or better, and well annotated domain boundaries[12]. This structure is henceforth referred to as the “reference structure” and is used to identify surface, buried, contact and non-contact residue columns within the MSA. The MSA was generated by using the structural protein as a BLAST query[41, 42] against the NCBI-NR database[43]. The homologs identified were made non-redundant at the 90% level using Cd-hit[44]. The final alignment was generated using MUSCLE[45] and MaxAlign[46].
Amongst the set of 67 protein cases available from[12], proteins that contain a single domain that interacts with more than one other domain in the set are disregarded. We chose to omit these proteins as domains interacting with multiple domains may have undergone correlated mutations not pertaining to the pair of domains being presently considered. We thus lose 15 of the 67 cases. In order to aid statistical analysis of the results we select only those domain pairs that have at least 20 contact and 20 non-contact residues on each domain, and the corresponding MSA columns of these residues must be ungapped and have an entropy greater than 0. Therefore a further 9 test cases are lost. Similarly, we filter out 1 test case that has less than 20 surface and buried residues respectively. These factors, along with eliminating 2 cases that have poorly annotated secondary structures in their reference pdb structure file[40], leave us with 40 inter-domain MSAs (Table5; using[12, 47]). The 80 single domains in this dataset range in size from 60 to 376 residues.
Within the 40 inter-domain MSAs there are non-standard amino acid entries, such as B, Z, X, * and ?. As there is no established method of processing these sequencing uncertainties, we choose to treat them as a gap, while the Brown and Brown pipeline code processes them as additional amino acids[11].
Identifying the surface versus buried residue pairs
For each reference structure protein in the dataset, we calculate the solvent accessibility of the residues using JOY[48]; each domain is treated as a separate entity. In the reference structure, residues that are >7% accessible to a 1.4Å radius water molecule are denoted as “surface” residues[48]. Those that do not meet this criterion are termed “buried.” This information about a residue is then annotated to the entire MSA column to which it belongs.
Employing this criterion on our 40 test cases, along with eliminating residue columns that have an entropy of 0 or contain a gap, leaves us with 5483 surface residues and 2364 buried residues. These numbers decline to 5362 and 2174 respectively when employing the 40 reduced alphabet MSAs, as the reduced alphabet MSAs have a greater number of columns with 0 entropy (Equation (1)). The ratios of surface to buried residues in the reduced and non-reduced alphabet sets are 2.5 and 2.3 respectively. The number of 0 entropy columns in the reduced and non-reduced alphabet set are 668 and 326 respectively, while the number of columns with one or more gaps in both sets are 3479.
Identifying the contact versus non-contact residues
Residues within the binding interface of a pair of interacting domains are labelled as “contact” residues (Figure3). There is not one particular accepted definition for contact residues[26, 49–51]. Here we classified a residue in the representative protein structure as a “contact” residue if:
-
1.It
is on the surface of the individual domain.
-
2.It
is < 4.5Å from a residue in the other domain[50].
-
3.The
solvent accessibility of the residue is different depending on whether the domain is viewed as a separate structural entity or whether the domain is in complex.
If not all of the above criteria are met, a residue is denoted as a “non-contact” residue.
Using these criteria, and once again ignoring residue columns having an entropy of 0 or a gap, leaves us with 1342 contact and 4141 non-contact surface residues, over all 40 test cases. These numbers decline to 1306 and 4056 respectively when employing the reduced alphabet on the 40 test cases. The ratio of contact to non-contact residues is 0.32 in both the reduced and non-reduced alphabet sets.
Calculating the Shannon Entropy
The entropy H
unstandardised
(J) of each column J in an MSA is calculated by Equation (1) with log denoting log
e
here,
(1)
We use log
e
in our MI calculations so that we may compare our results with other MI investigations[23, 27]. In this equation J is a column in the MSA with probabilities P(J = j) for the discrete set of n amino acids jε{1,…,n}. When P(J = j) = 0 then we set P(J = j)log P(J = j) = 0. The entropy is maximal when all j are equally likely to occur, i.e. P(J = j) = 1/n and[13].
In order to compare the entropies from different MSAs we standardise the entropy score as follows:
(2)
where H
unstandardised
(J) is the entropy of column J in the MSA, and and are the average entropy and estimated standard deviation, respectively, over all columns in the MSA combined. Our calculated entropies range from 0.0 to 2.8, while the standardised entropies vary from -4.2 to 3.0.
Calculating MI
The joint entropy of two columns J and K is defined as:
(3)
where column J has n different residues, and column K has m different residues.
The general MI formula is:
(4)
The MI is maximal when residues in columns J and K always covary, i.e. P(J = K) = 1 making the. The maximum MI that can be observed for protein sequences, which have 20 varying residues, is log 20 ≃ 2.9957[13].
Unstandardised MI values of 0 are omitted from any further analysis. The reason for this explained in the section titled “MI scores of 0.” The average MI and estimated standard deviation of the MI of all contact and non-contact pairs in the protein are then calculated. A “standardised MI score” is calculated as
(5)
where M I
unstandardised
(J;K) is the MI of columns J and K in the MSA, and and are the average MI and estimated standard deviation respectively, over all interacting domains’ column pairs in the MSA, excluding pairs with an MI value of 0, involving a 0 entropy residue column or a gapped residue column.
Our calculated M I
unstandardised
scores vary from 0.0 to 1.6, while the standardised MI scores range from -3.2 to 3.7.
Calculating MIp
Dunn et al. proposed a variant of MI that aims to correct for background (random and phylogenetic) noise of each pair of columns under consideration, MIp[9] . This MI correction is denoted by the equation
(6)
where M I
unstandardised
(J;K) is calculated as denoted in Equation (4). As previously, pairs involving a 0 entropy residue column or a gapped residue column, or having an MI score of 0 are ignored. APC(J;K), the average product correction, is a modification term for columns J and K in the MSA, evaluated as follows:
(7)
where is the average mutual information for column J, is the average mutual information for column K, and is the overall average mutual information.
As done previously for MI, MIp scores are also standardised (Equation (8)),
(8)
where MI p
unstandardised
(J;K) is the MIp of columns J and K in the MSA and and are the average MIp and estimated standard deviation respectively, over all calculated column pairs in the protein.
Our MIp scores vary from 0.0 to 0.4, while the standardised MIp scores range from -3.1 to 7.0.
Calculating MIc
Lee and Kim designed normalising measures that aim to reduce phylogenetic noise in MI scores[25]. They begin with the coevolutionary pattern similarity score (CPS) that measures the similarity between the MI score patterns of the two residues being considered. It is denoted as follows,
(9)
Here M I
unstandardised
(J;L) is the MI score of the columns J and L, which is calculated as described in Equation (4). The number of columns in the MSA are denoted by n. Since the CPS is the product of two MI scores, it is then normalised by the square root of the mean of all CPS scores.
(10)
To adapt the NCPS for domain-domain prediction we consider only those CPS scores that refer to domain-domain column pairs, i.e. one column from each protein, and adjust n in Equation (10) accordingly. Once again MI values of 0 are ignored, as are 0 entropy and gapped residue columns. This NCPS score is then subtracted from the corresponding MI pair score to yield Lee and Kim’s noise reduced MI variant, MIc.
(11)
MIc scores for each protein are standardised in a manner similar to MI and MIp (Equations 5 and 8), so that MIc values from different proteins can be compared.
(12)
where MI c
unstandardised
(J;K) is the MIc of columns J and K in the MSA and and are the average MIc and estimated standard deviation respectively, over all column pairs being considered in the protein.
The MIc scores calculated on our 40 test cases range from -0.02 to 0.1, while the standardised scores range from -2.4 to 7.7.
3-dimensional (3D) MI and MIp
MI[27] and MIp[9] were adapted to consider triangles of residues;
(13)
where MSA column J from domain 1 has n different residues, column K from domain 2 has m different residues, and column L from domain 2 has s different residues. Residues in the representative protein structure, corresponding to columns K and L, should be < 4.5Å from each other in order to be considered as being on the same patch in the domain.
MIp3D is defined as
(14)
In this equation MI 3D
unstandardised
(J;K;L) is calculated as denoted in Equation (13) and APC 3D(J;K;L) is calculated as
(15)
where is the average 3D mutual information for column J, is the average 3D mutual information for column K, is the average 3D mutual information for column L, and is the overall average 3D mutual information.
In order to compare the 3D mutual information scores between test cases, MI3D and MIp3D scores were standardised in a manner similar to those described in Equations (5), (8) and (12), respectively. Once again MI3D values of 0 were ignored, as were 0 entropy columns and columns containing one or more gaps.
Reduced Alphabet (RA) MI scores
We grouped the 20 amino acids into the same seven physiochemical categories successfully employed by Hamer et al. in their domain-domain contact predictor, i-Patch[12]. These seven categories include: Small (S,G,A,P), Hydrophobic (V,M,I,L,C), Negatively charged (D,E), Aromatic (F,Y,W), Polar (Q,T,N), Favoured Positively-charged (R,H), and Disfavoured Positively-charged (K). These physiochemical groups are abbreviated to S, H, N, A, P, F and D respectively. Hamer et al. introduced Favoured and Disfavoured categories because Lysine (K) was found to be rare in protein/domain interfaces (propensity 0.66), while Arginine (R) and Histidine (H) were far more common (propensities of 1.05 and 1.11, respectively)[12].
We replaced the amino acid alphabets in each MSA by their corresponding category abbreviation and recalculated MI, MIp, MIc, MI3D and MIp3D as described above. The five new MI variant scores are referred to as MIRA, MIpRA, MIcRA, MI3DRA and MIp3DRA.
We choose to employ this particular set of seven physiochemical categories as it was successfully used by i-Patch[12] in domain-domain contact prediction. We do not expect another grouping to dramatically improve the predictive capabilities of MI and its variants further.
P-ROC curves
For classification, each residue in all 40 test cases was assigned the maximum MI score that its residue column achieved with any other residue column in its MSA. When the average score of each residue was assigned instead, the performance of the MI variants decreased significantly, consequently the maximum score was employed.
When there is a disproportionate number of positive versus negative cases, P-ROC (Precision Recall Operating Characteristic) curves[52] provide an alternative to ROC (Receiver Operating Characteristic) curves[53] when attempting to evaluate the performance of a classifier. In our 40 test case dataset, contact residues constitute only 24.5% of all residues, thus the P-ROC will be more informative than the ROC curve. To calculate precision and recall the percentiles of the scores are used as cut-offs, where
(16)
and
(17)
TP in these equations denote the number of true positives, FP denotes the number of false positives and FN symbolises the number of false negatives.
Each P-ROC plot contains a flat horizontal line (green) at. This line denotes the probability of randomly discriminating positive versus negative cases. For example, in Figure2 the solid green line is at 0.245 because there are 1342 “contact” scores out of 5483 total surface scores in the non-reduced alphabet test set. In this figure the dashed green line is at 0.244 because there are 1306 “contact” scores out of 5362 total surface scores in the reduced alphabet test set. Similarly, in Additional file1: Figure S1 the solid green line is at 0.699 as there are 5483 “surface” scores out of 7847 total scores in the non-reduced alphabet test set, while the dashed green line is at 0.712 for there are 5362 “surface” scores out of 7536 total scores in the non-reduced alphabet test set.
Sub-sampling to test stability of MI scores
To test the stability of the 10 MI variant scores under minor changes in the MSA, for each test case 70% of sequences in the MSA were randomly selected and all 10 MI scores recalculated and 10 respective P-ROC curves were plotted. This sub-sampling and calculation process was repeated 100 times per test case for every MI variant. Then the average and standard error of the precision values for the 100 P-ROC curves were calculated for each MI variant. The precision values at 20% recall for each of the MI variants are listed in Table2.
This sub-alignment creation and MI recalculation process was only carried out on those 24 test cases that had ≥200 sequences to ensure that a minimum of 125 sequences were retained in each sub-alignment, the suggested minimum number of sequences required to reduce the stochastic noise in the MSA[8].
MI scores of 0
Pairing any MSA column with a fully conserved column, i.e. a column with an entropy of 0, results in a joint entropy equivalent to the entropy of the non-fully conserved column and subsequently an MI score of 0 for that pair. Since conserved columns do not give any indication of correlated mutations, MI scores involving these columns are ignored. This is standard procedure; for example,[6]. The relationship between percent of columns in an MSA with an entropy of 0 and percent MI scores of 0 computed can be observed in Figure4. This approximately linear relationship further affirms the direct influence a column with an entropy of 0 has on the MI score of pairs involving that column.