Open Access

Mutual information and variants for protein domain-domain contact prediction

  • Mireille Gomes1,
  • Rebecca Hamer1,
  • Gesine Reinert1 and
  • Charlotte M Deane1Email author
BMC Research Notes20125:472

DOI: 10.1186/1756-0500-5-472

Received: 15 May 2012

Accepted: 10 August 2012

Published: 31 August 2012



Predicting protein contacts solely based on sequence information remains a challenging problem, despite the huge amount of sequence data at our disposal. Mutual Information (MI), an information theory measure, has been extensively employed and modified to identify residues within a protein (intra-protein) that are in contact. More recently MI and its variants have also been used in the prediction of contacts between proteins (inter-protein).


Here we assess the predictive power of MI and variants for domain-domain contact prediction. We test original MI and these variants, which are called MIp, MIc and ZNMI, on 40 domain-domain test cases containing 10,753 sequences. We also propose and evaluate two new versions of MI that consider triangles of residues and the physiochemical properties of the amino acids, respectively.


We found that all versions of MI are skewed towards predicting surface residues. Since domain-domain contacts are on the surface of each domain, we considered only surface residues when attempting to predict contacts. Our analysis shows that MIc is the best current MI domain-domain contact predictor. At 20% recall MIc achieved a precision of 44.9% when only surface residues were considered. Our triangle and reduced alphabet variants of MI highlight the delicate trade-off between signal and noise in the use of MI for domain-domain contact prediction. We also examine a specific “successful” case study and demonstrate that here, when considering surface residues, even the most accurate domain-domain contact predictor, MIc, performs no better than random.


All tested variants of MI are skewed towards predicting surface residues. When considering surface residues only, we find MIc to be the best current MI domain-domain contact predictor. Its performance, however, is not as good as a non-MI based contact predictor, i-Patch. Additionally, the intra-protein contact prediction capabilities of MIc outperform its domain-domain contact prediction abilities.


Proteins are actors in a complex system, with their functions to a large extent defined by their interactions with other proteins. It is the size, shape and chemical properties of the residues on the surface of the protein that dictate the capacity of a protein to interact with other proteins. The ability to predict the residues involved in these interactions would help to identify specific functionality, structural constraints and even disease-causing mutations.

In this paper we examine the capacity for mutual information (MI) methods to predict these contact residues between proteins by assessing their ability to predict contacts between two domains of a protein. To date, MI based methods have been extensively used to predict contacts within a protein (intra-protein)[112].

MI uses a multiple sequence alignment (MSA) of homologous sequences; it measures the dependence between two columns in this alignment, with the aim of identifying correlated mutations[13]. If two residues are in close proximity, it is likely that a change in size, shape or chemistry of one will need to be compensated for by a change in the other, if the contact is to remain energetically favourable[3, 1416]. These compensatory mutations are often referred to as correlated mutations.

MI was first applied to sequence alignments by Korber et al. to identify covarying positions in a viral peptide[1]. We hypothesise that MI has since gained popularity because it is non-parametrized, i.e. the scores of MI are solely dependent on an MSA and no additional information, such as a phylogeny propensity table[12], a similarity matrix[17, 18] and so on is required. Furthermore, unlike other algorithms that predict contact “patches”[7, 19], or individual contact residues[20, 21], MI attempts to predict specific pairs of residues that are in contact with one another.

Horner et al.[22] have collated the accuracies of intra-protein contact residue prediction from several publications that employ correlated mutation analysis algorithms, and showed that MI has an accuracy between 2% and 18%. Accuracy here refers to the percentage of predictions that are correct. The low accuracy of MI has in turn precipitated many variants that attempt to improve its performance. These variants specifically attempt to correct for the following three recognised limitations of MI: highly variable columns, phylogenetic relationships and insufficient sequences in the MSA. There is evidence that columns in the MSA that have a high variability contribute to random and non-random high MI scores[8, 23], while phylogenetic relationships[5] and insufficient number of sequences in the MSA[8] weaken the signal detection ability of MI. In 1995, Clarke[2] corrected the MI score by a measure relating to the number of amino acid pairs occurring at each position to negate the influence of highly diverged sequences that may be inappropriately aligned in the MSA. Later, Wollenberg and Atchley[5] used parametric bootstrapping to adjust for evolutionary relationships; another technique for reducing phylogenetic noise in the MSA before employing MI is the evolutionary trace approach[24]. Tillier and Lui[6] designed a tool which removes columns in an MSA that carry a high phylogenetic signal and then employs MI to identify positions in the resulting MSA that coevolve with each other, but do not coevolve significantly with other positions. As performance was still disappointing, Martin et al. attempted to remove the noise caused by entropy by dividing the MI score of a pair of columns by their joint entropy[8]. These authors also suggested that a minimum of 125 sequences should be used in an MSA to reduce stochastic noise. Dunn et al. improved on this score by introducing MIp, which modified the MI value by a measure that aims to eliminate phylogenetic and entropic effects[9]. Subsequently, Lee and Kim[25] introduced two other powerful phylogenetic noise reduction MI measures, MIc and aMIc. In 2010 Brown and Brown[11] suggested yet another MI measure, ZNMI, that accounts for different alphabet sizes among columns in the MSA. These authors also proposed a pipeline to yield highly reproducible scores. Despite all of these efforts, to date no single MI measure has achieved general utility or wide acceptance for predicting intra-protein contact sites.

MI has begun to be extended to predict inter-protein contact residues. Halperin et al.[26] carried out a small study of original MI and other correlation algorithms on 15 bacteria and archea fusion protein families, and Lee and Kim[25] evaluated their newly formulated MI measures on a specialised dataset of 27 homo-trimers. There have also been several high profile case studies on a small number of examples (one to three), such as[811, 27]. However, so far there has been no systematic study on a large, general purpose, cross-species dataset of the performance of MI and its latest variants on inter-protein contact residue prediction, partly because no such large inter-protein dataset with accurately paired MSAs is available[25, 26].

Following the idea of Pazos et al.[18], here we make a step towards closing this gap by employing MI and its variants on 40 domain-domain test cases; the contacts between domains serve as a proxy for inter-protein contacts[18]. We evaluate MI measures that do not require any additional information and rely solely on the sequence alignment itself; we focus on the original MI and its most recent extensions MIp and MIc, as well as ZNMI. Unlike the other measures, the ZNMI score is embedded in an iterative pipeline designed by its authors[11]. Hence we have to analyse the performance of ZNMI differently to its competitors.

We have also attempted to strengthen the predictive capabilities of MI by introducing two new MI variants. The first variant considers triangles of residues rather than pairs to identify contacts, with the aim of enhancing the signal, and is referred to as MI3D and MIp3D. As MIc already considers a third column in its normalising term, it was not extended to triangle scores. Our second variant is designed to reduce noise by grouping residues in the MSA into seven physiochemical categories and subsequently calculating MI. This modification is indicated by the suffix RA (reduced alphabet), and the resulting five variants are: MIRA, MI3DRA, MIpRA, MIp3DRA and MIcRA. Thus altogether we examine the domain-domain contact prediction ability of 10 MI measures: MI, MI3D, MIRA, MI3DRA, MIp, MIp3D, MIpRA, MIp3DRA, MIc and MIcRA, alongside the pipeline ZNMI.

In our study we observe that all versions of MI do not perform as well as a current leading domain-domain contact predictor, i-Patch[12], which achieves a precision of 48.9% at 20% recall on the 40 test cases. The enhanced predictive ability of i-Patch may arise from its use of surface residues only, residue propensity scores, as well as the MSA. Conversely, MI variants rely solely on the MSA. Upon further investigation we find that MI, MIp and MIc scores are skewed towards predicting surface residues rather than contact. Thereafter, like i-Patch, we consider surface residues only when attempting to predict domain-domain contacts using MI. This in turn improves the contact prediction abilities of all 10 MI measures. Amongst the 10 tested MI variants and ZNMI, we find that MIc is the leading MI domain-domain contact predictor, attaining a precision of 44.9% precision at 20% recall.

Results and discussion

For this inter-protein MI investigation we use 40 proteins that have two domains, and treat each domain as a separate protein. Proteins are composed of one or more domains. A multidomain protein, protein A, will often use one of its domains to bind to protein B and another to bind to protein C, thus allowing protein A to perform multiple functions. Consequently, in reality protein-protein interactions are often domain-domain interactions[28]. Hence using domain-domain interactions within proteins as a proxy for protein-protein interactions ensures that interacting “proteins” are accurately paired in each MSA, while capturing interaction mechanisms.

Unfortunately the seemingly more straightforward approach of building MSAs by finding homologous sequences of two proteins known to be in complex, and then pairing the sequences originating from the same species, is not feasible for the following reasons. Inferring protein-protein interactions across species based on sequence homology has a low level of accuracy, requiring a sequence identity of far higher than 70%[29]. Using only sequences with >70% identity would have resulted in MSAs with a low number of sequences and few amino acid changes, not sufficient enough to yield statistically significant MI results[8]. Furthermore, within a species there may be multiple homologs of the interacting proteins, and selecting the correct pairs out of this is an unsolved problem[29]. Therefore previous protein-protein contact residue prediction investigations have chosen to employ proteins of known structure, with two domains, that have several homologous sequences available[12, 18]. Hamer et al.[12] have shown that the propensities of amino acids to occur as contact residues between two domains in a protein and between protein complexes are highly similar. While it is hence plausible to use domain-domain interactions as a proxy for inter-protein interactions, it is however possible that protein-protein interfaces may indeed differ from domain-domain interfaces.

Here we want to predict whether a residue is a contact residue or not using MI-based scores on the generated domain-domain MSAs. As MI assigns scores to pairs of columns in an MSA, first we calculate the MI score for all pairs of columns, or triangles of columns in the case of the 3D variants. To obtain a score for individual columns, each residue in all 40 test cases was assigned the maximum MI score that the residue column achieved with any other residue column in its MSA. We also tested assigning the average score of each residue column, but this resulted in a significant decrease in the performance of the MI variants. A residue is then assigned the score of its column.

When calculating MI pair or triplet scores, as in previous work[9, 25], only ungapped aligned columns were used. When allowing gapped columns there was a tendency for MI methods to underperform.

For the 40 test cases employed, the probability of randomly selecting contacts, i.e. correctly picking a contact residue from the total set of residues, without any information about the proteins involved, is 17.1%. Running i-Patch[12], a non-MI-based domain-domain contact predictor, on this dataset resulted in a precision of 48.9% at 20% recall. When assessing domain-domain contact prediction, Lee and Kim[25] found that their MIc measure outperformed their additionally normalised aMIc score[25], as well as Dunn et al.’s MIp measure[9]. We also found MIc to be the best domain-domain contact predictor of our tested MI variants. On our 40 test cases, MIc attained a precision of 34.7% at 20% recall, demonstrating that the performance of MI methods is below that of the parametrised method i-Patch.

We conjecture that the enhanced classification capabilities of i-Patch may be due to its use of residue propensities, along with its consideration of only residues on the surface of a protein when attempting to predict contacts between proteins. Consequently, we examined the effect of surface versus buried residues on domain-domain MI contact prediction.

In our 40 test cases, the probability of randomly selecting a surface residue from all residues is 69.9%. Using MI, MIp and MIc on our dataset as surface residue predictors (is the highest scoring residue on the surface?), we observed that each of the measures surpassed this random classification and achieved a precision of 86.9%, 75.5% and 74.1% respectively, at 20% recall (Additional file1: Figure S1). Thus it appears that high scores of all three variants of MI are skewed towards surface residues. This is probably due to the observed high entropy of surface residue columns.

Prior investigations have shown that MI scores strongly correlate with the entropy of the columns involved[8, 23]. Figure1 shows that MSA columns corresponding to surface residues tend to have a higher entropy than those associated with buried residues. The observed lower column entropy for buried residues is consistent with previous studies that have shown that buried residues are under greater evolutionary constraints than solvent-accessible surface residues[3033]. A slower rate of evolution of these residues is unsurprising since buried residues often play a crucial role in maintaining the 3D structure of a protein. We hypothesise that this skewness of MI towards surface residues in turn perturbs its ability to predict contact residues. With this in mind we eliminated buried residues from further evaluation of the performance of MI, MIp and MIc for domain-domain contact prediction.
Figure 1

Standardised entropy medians of surface versus buried residue columns for all domains in the dataset. Comparing the medians of the standardised entropy scores of each domain’s surface residue columns (blue) against the medians of each domain’s buried residue columns (yellow). Residue columns containing one or more gaps, or having an entropy score of 0 are not included in the median calculation.

After filtering out buried residues in the 40 test cases, the precision of MIc increases from 34.7% to 44.9% at 20% recall (Figure2C). The probability of randomly selecting a contact residue is now 24.4%. Excluding buried residues therefore clearly has a considerable effect. As can be observed in Figure2, Table1 and Additional file2: Figure S2, MIc still outperforms the other MI variants. MI and MIp achieve a precision of 24.4% and 42.3% respectively at 20% recall (Figures2A and B, and Table1).
Figure 2

Contact versus non-contact prediction P-ROC curves for MI variants on the 40 test cases. A, B and C illustrate the performance of MI, MIp and MIc variants respectively when distinguishing contact from non-contact surface residues. The solid green line in all plots depicts the chance of randomly selecting a contact residue, while the dashed green line indicates the probability of randomly selecting a contact residue when employing the reduced alphabet amino acid set.

Table 1

Precision for detecting contact versus non-contact residues at 20% recall

Contact prediction

MI variant

Precision contact vs. non-contact

























Results are given for the 40 test cases. MI variants are listed in descending order of contact versus non-contact precision, i.e. best to worst classifier of contact residues. The probability of randomly selecting a contact residue from all surface residues is 24.5%. This probability changes to 24.4% when using the reduced alphabet amino acid set because residues are lost as the entropy of their corresponding MSA column reduces to 0.

For a meaningful comparison of scores however, we need to assess their stability. In order to test this we randomly selected 70% of the sequences from each MSA 100 times, and recalculated the variant MI scores for each of the sub-alignments (Table2). We observed that the rank order of the top five MI variant scores was maintained (Table2). This sub-alignment creation and MI recalculation procedure was only carried out on those 24 test cases that had ≥200 sequences, to ensure that there would be at least 125 sequences in the sub-alignments, the suggested minimum number of sequences required to reduce the stochastic noise in the MSA[8]. Hence the results in Table1 refer to 40 test cases, while those in Table2 reflect the mentioned subset of 24 cases. Based on two-sample t-tests, with a sample size of 24, the differences between the top four scores in Table2 are highly significant at the 0.1% level.
Table 2

Precision for detecting contact versus non-contact residues at 20% recall, for sub-alignments of 70%

Contact prediction for sub-alignments

MI variant

70% AVG



















































70% of sequences in an MSA were randomly selected and the 10 MI variant scores based on the new sub-alignment were calculated. This subset selection and calculation procedure was repeated 100 times for those test cases that had ≥200 sequences to ensure ≥125 sequences in each sub-alignment[8]. Thus 24 test cases were used. For each of the 100 iterations a P-ROC curve similar to Figure2 was plotted for the 24 test cases (figures not shown), and the precision at 20% recall recorded. Columns one and two, respectively, contain the averages and standard deviations of these 100 precision values. Column three indicates the precision attained at 20% recall when all sequences in the 24 original MSAs were used. When considering this set of 24 cases, the probability of randomly selecting a contact residue from all surface residues is 24.4% generally and when using the reduced alphabet MSAs. The MI variants are listed in descending order of the average precision of the 100 70% sub- alignments. The presence of an ‘*’ at an MI variant indicates that the difference between the precision of this MI variant and the next lowest is significant at the 0.1% level, when using two-sample t-tests with a sample size of 24.

Additionally, it is worth noting that the performance of all non-3D MI variants improve when using MSAs that have ≥200 sequences (Tables1 and2).

To account for the mentioned variability in scores due to changes in the MSA, Brown and Brown[11] have designed a new MI measure, ZNMI, as well as a methodology to yield highly reproducible and accurate contact pair prediction scores. Their suggested algorithm repeatedly partitions the MSAs into 50% sub-alignments, calculates the pair scores, retains significant scoring pairs for each partition and subsequently compares all partitions to acquire consensus pair scores. It should be noted that unlike our methodology, this pipeline does not filter out buried residues. The authors provided us with code for MI[8], MIp[9], OMES[23, 34], SCA[35], ZNMI[11] and ZRES[10] measures wrapped within their proposed pipeline, but unfortunately not for MIc. Having run this code on our 40 domain-domain test cases we find that using ZNMI in conjunction with their algorithm does improve on the performance of original MI; at 20% recall the precision of ZNMI is 30.5% (Table3), as opposed to the 24.4% precision of original MI (Figure2A and Table1). ZNMI within the Brown and Brown pipeline even outperforms MIp, when MIp is incorporated into the same pipeline (27.1% precision at 20% recall; Table3). However, the performance of MIp independent of the pipeline, after filtering out buried residues and columns with one or more gaps, supersedes ZNMI and all other coevolving residue algorithms tested by the authors, as illustrated by its precision of 42.3% at 20% recall (Table3).
Table 3

Precision at 20% recall of contact prediction algorithms used within Brown and Brown[11] pipeline

Brown and Brown[11] pipeline


Precision contact vs. non-contact

MIp - original, minus buried














MI - original, minus buried


Results are given for the 40 test cases. The Brown and Brown[11] pipeline was applied to the contact residue prediction algorithms listed in column one, with the exceptions of MIp and MI “original, minus buried.” As in Table1, these two algorithms were run independently of the pipeline and buried residue columns, residue columns with one ore more gaps or an entropy of 0 were filtered out. The table is arranged in descending order of precision.

3D and Reduced Alphabet MI adjustments

To investigate methods that might further enhance the predictive power of MI variants we designed two adjustments. The first adjustment considers triangles of columns rather than pairs, based on the idea that interactions occur in patches[36]. This variant is denoted by the suffix 3D. The second adjustment, suffixed RA, reduces the 20 amino acids to seven categories based on their physical and chemical properties, with the aim of reducing noise.

The idea behind the 3D version is that protein binding involves patches of residues in contact. This idea has been previously used to predict contact residues[12, 36]. Furthermore, the success of MIc lies in its normalising factor, the coevolutionary pattern similarity (CPS) score, which estimates the coevolutionary relationship between the pair of residues currently under consideration and all other residues in the MSA[25]. We thus speculated that adding additional residue information to MI and MIp pair scores may enhance their domain-domain contact predictive capabilities. Hence we created new versions of MI and MIp that consider triangles rather than pairs of columns to identify contacts (MI3D (Equation 13) and MIp3D (Equation 14)). Increasing the dimensionality of MI and MIp in this manner surprisingly worsened performance in both cases; the precision at 20% recall of MI3D and MIp3D are 19.9% and 31.8% respectively as compared to precision of MI and MIp of 24.4% and 42.3% (Figure2 A and B, and Table1). We conjecture that adding an extra dimension to MI and MIp magnifies the noise in the MSA more than it boosts the signal.

Assuming that contact residues mutate in a correlated manner in order to maintain their interaction, it is not evident how much of a change a residue can undergo while still maintaining its contacts. Using a reduced alphabet residue set addresses this point; as it groups residues by their physiochemical properties, under the assumption that residues with the same physiochemical properties will maintain similar interactions. Grouping the 20 amino acids into seven categories only improved the performance of basic MI and MI3D, which rose in precision at 20% recall from 24.4% to 28.4% and 19.9% to 23.5% respectively (Figure2A and Table1). In all other cases the reduced alphabet (RA) appeared to reduce noise as well as signal (Figure2, Table1 and Additional file2: Figure S2).

Case study

The case study Skerker et al.[27] has received a lot of attention for successfully determining inter-protein contact specificity residues with the aid of MI. The authors used original MI (Equation (4)) to determine a subset of contact residues that allow for specific binding of a histidine kinase (HK) with its interacting response regulator (RR). The MSA provided by these authors does not contain the sequence of the structure used in their analysis. Hence we ran MI and MIc on the HK-RR MSA provided by Hamer et al.[12], which does include the sequence of this reference structure.

As Skerker et al. were interested in residue pairs only between the DHp domain (four helix bundle) of the HK and its interacting RR, only these MI and MIc scores were considered when examining performance. In accordance with our evaluation method on the 40 test cases, all buried residues were eliminated, as were residues corresponding to columns that had one or more gaps or an entropy of 0. This leaves us with 46 DHp residues, nine of which are contacts, and 68 RR residues, amongst which 24 are contacts.

As we have no score cut-off for predictions we check the number of correct predictions among the top nine predictions for DHp. If there was no relationship between the MI scores and contact sites, then the number of correct predictions would follow a Binomial distribution with sample size nine and probability of success 9/46. Under this model we would expect 1.76 correct predictions.

For RR there are 24 contact residues. We check the number of correct predictions among the top 24 predictions. If there was no relationship between the MI scores and contact sites, then the number of correct predictions would follow a Binomial distribution with sample size 24 and probability of success 24/68. Under this model we would expect 8.47 correct predictions.

The results are recorded in Table4. The p-value is the probability of seeing a number this large or larger under the corresponding Binomial model. None of the p-values are below 5%. Therefore at the 5% level there is no statistical evidence to reject the null hypothesis that in this case study random guess does as well as MI and MIc.
Table 4

Performance of MI and MIc on a histidine kinase (HK) - response regulator (RR) complex

Case study


DHp: 9 contacts out of 46

RR: 24 contacts out of 68


Contacts among top 9


Contacts among top 24












MI and MIc are run on the HK-RR MSA provided by Hamer et al.[12]. Each surface, ungapped DHp residue column, having a column entropy greater than 0, is assigned the maximum score it achieves when paired with the RR residue columns. These DHp residues are then ranked according to score and the number of true contact residues amongst the top nine scores are recorded for MI and MIc respectively. The same steps are applied to residues in the RR and the number of true contact residues in the top 24 scores are counted for MI and MIc. The p-value refers to the probability of seeing a number this large or larger under the corresponding Binomial model.


MIc is the best current MI domain-domain contact predictor. The performance of MIc on our domain-domain test cases is not as good as its intra-protein contact prediction[25]. Its predictive capabilities are also not as high as i-Patch[12], a non-MI-based domain-domain contact predictor, but unlike this algorithm, MIc relies solely on sequence information in an MSA. Our 3D and reduced alphabet variants of MI did not improve prediction, but illustrate the delicate trade-off between signal to noise in the use of MI for domain-domain contact prediction.


In order to perform the tasks in our methods, code was written in perl, MATLAB and Python, and is available upon request.


For this domain-domain MI investigation we use proteins that have two domains, rather than protein complexes, and treat each domain as a separate protein. In this manner we can be sure that accurate “protein-protein” pairings are used in the MSA. The MSAs are taken from[12], available athttp://​www.​stats.​ox.​ac.​uk/​research/​bioinfo/​resources, which in turn are based on datasets in[18, 3739]. The proteins for which each MSA was constructed has a known pdb structure[40] of X-ray resolution 2.5Å or better, and well annotated domain boundaries[12]. This structure is henceforth referred to as the “reference structure” and is used to identify surface, buried, contact and non-contact residue columns within the MSA. The MSA was generated by using the structural protein as a BLAST query[41, 42] against the NCBI-NR database[43]. The homologs identified were made non-redundant at the 90% level using Cd-hit[44]. The final alignment was generated using MUSCLE[45] and MaxAlign[46].

Amongst the set of 67 protein cases available from[12], proteins that contain a single domain that interacts with more than one other domain in the set are disregarded. We chose to omit these proteins as domains interacting with multiple domains may have undergone correlated mutations not pertaining to the pair of domains being presently considered. We thus lose 15 of the 67 cases. In order to aid statistical analysis of the results we select only those domain pairs that have at least 20 contact and 20 non-contact residues on each domain, and the corresponding MSA columns of these residues must be ungapped and have an entropy greater than 0. Therefore a further 9 test cases are lost. Similarly, we filter out 1 test case that has less than 20 surface and buried residues respectively. These factors, along with eliminating 2 cases that have poorly annotated secondary structures in their reference pdb structure file[40], leave us with 40 inter-domain MSAs (Table5; using[12, 47]). The 80 single domains in this dataset range in size from 60 to 376 residues.
Table 5

The dataset








1 82

83 173




67 270

271 317




1 188

189 268




19 152

153 314




1 51

52 227




18 216

261 350




1 158

159 200




2 169

170 319




1 147

148 484




7 148

149 319




1 162

163 246




1 255

256 452




1 156

157 293




1 127

128 208




88 227

228 357




3 176

177 346




1 246

247 337




1 376

377 484




16 108

109 253




3 95

96 195




16 97

102 189




1 188

189 235




10 300

301 406




1 71

72 180




3 76

77 188




1 247

261 480




17 67

68 207




8 86

87 216




1 172

178 433




3 136

137 368




2 56

57 250




20 142

143 296




201 284

285 495




23 268

269 328




1 220

248 319




1 126

127 436




1 154

155 333




1 201

202 335




1 82

90 217




1 135

136 316



The “protein” column contains a list of pdb identifiers[40]. D1 and D2 columns denote the start and end pdb residues of domains 1 and 2, respectively. For all pdbs listed, the start and end residues are located in chain A of the structure, except for pdb 1W98 where the mentioned domains are in chain B, and pdb 8TLN in chain E. The “sequences” column indicates the number of sequences present in the multiple sequence alignment (MSA). The final column states the distribution of sequences in each MSA taken from the various species’ domains: eukaryotes (E); archea (A); bacteria (B); viruses (V); unclassified (U); and not found (N), i.e. those sequences that could not be found in the NCBI Taxonomy Database. This dataset was taken from Hamer et al.[12].

Within the 40 inter-domain MSAs there are non-standard amino acid entries, such as B, Z, X, * and ?. As there is no established method of processing these sequencing uncertainties, we choose to treat them as a gap, while the Brown and Brown pipeline code processes them as additional amino acids[11].

Identifying the surface versus buried residue pairs

For each reference structure protein in the dataset, we calculate the solvent accessibility of the residues using JOY[48]; each domain is treated as a separate entity. In the reference structure, residues that are >7% accessible to a 1.4Å radius water molecule are denoted as “surface” residues[48]. Those that do not meet this criterion are termed “buried.” This information about a residue is then annotated to the entire MSA column to which it belongs.

Employing this criterion on our 40 test cases, along with eliminating residue columns that have an entropy of 0 or contain a gap, leaves us with 5483 surface residues and 2364 buried residues. These numbers decline to 5362 and 2174 respectively when employing the 40 reduced alphabet MSAs, as the reduced alphabet MSAs have a greater number of columns with 0 entropy (Equation (1)). The ratios of surface to buried residues in the reduced and non-reduced alphabet sets are 2.5 and 2.3 respectively. The number of 0 entropy columns in the reduced and non-reduced alphabet set are 668 and 326 respectively, while the number of columns with one or more gaps in both sets are 3479.

Identifying the contact versus non-contact residues

Residues within the binding interface of a pair of interacting domains are labelled as “contact” residues (Figure3). There is not one particular accepted definition for contact residues[26, 4951]. Here we classified a residue in the representative protein structure as a “contact” residue if:

  1. 1.It

    is on the surface of the individual domain.

  2. 2.It

    is < 4.5Å from a residue in the other domain[50].

  3. 3.The
    solvent accessibility of the residue is different depending on whether the domain is viewed as a separate structural entity or whether the domain is in complex.
    Figure 3

    Contact residues in a pair of interacting domains. Test case 1J5X.pdb[40]. The two structurally defined domains are depicted in orange (residue 2 to 169) and green (residue 170 to 319) respectively. In the magnified frame, residues in red denote contact residues. Dotted lines and corresponding numbers indicate the Ångström distance between a pair of atoms in the connected residues.


If not all of the above criteria are met, a residue is denoted as a “non-contact” residue.

Using these criteria, and once again ignoring residue columns having an entropy of 0 or a gap, leaves us with 1342 contact and 4141 non-contact surface residues, over all 40 test cases. These numbers decline to 1306 and 4056 respectively when employing the reduced alphabet on the 40 test cases. The ratio of contact to non-contact residues is 0.32 in both the reduced and non-reduced alphabet sets.

Calculating the Shannon Entropy

The entropy H unstandardised (J) of each column J in an MSA is calculated by Equation (1) with log denoting log e here,
H unstandardised ( J ) = i = 1 n P ( J = j ) log P ( J = j ) .

We use log e in our MI calculations so that we may compare our results with other MI investigations[23, 27]. In this equation J is a column in the MSA with probabilities P(J = j) for the discrete set of n amino acids {1,…,n}. When P(J = j) = 0 then we set P(J = j)log P(J = j) = 0. The entropy is maximal when all j are equally likely to occur, i.e. P(J = j) = 1/n and H unstandardised ( J ) = 1 n log 1 n = log n[13].

In order to compare the entropies from different MSAs we standardise the entropy score as follows:
H ( J ) = H unstandardised ( J ) H ¯ unstandardised σ H unstandardised ,

where H unstandardised (J) is the entropy of column J in the MSA, and H ¯ unstandardised and σ H unstandardised are the average entropy and estimated standard deviation, respectively, over all columns in the MSA combined. Our calculated entropies range from 0.0 to 2.8, while the standardised entropies vary from -4.2 to 3.0.

Calculating MI

The joint entropy of two columns J and K is defined as:
H unstandardised ( J ; K ) = j = 1 n k = 1 m P ( J = j , K = k ) log P ( J = j , K = k ) ,

where column J has n different residues, and column K has m different residues.

The general MI formula is:
MI ( J ; K ) unstandardised = H unstandardised ( J ) + H unstandardised ( K ) H unstandardised ( J ; K ) .

The MI is maximal when residues in columns J and K always covary, i.e. P(J = K) = 1 making the MI = j = 1 n P ( J = j ) log P ( J = j ) The maximum MI that can be observed for protein sequences, which have 20 varying residues, is log 20  2.9957[13].

Unstandardised MI values of 0 are omitted from any further analysis. The reason for this explained in the section titled “MI scores of 0.” The average MI and estimated standard deviation of the MI of all contact and non-contact pairs in the protein are then calculated. A “standardised MI score” is calculated as
MI ( J ; K ) = M I unstandardised ( J ; K ) MI ¯ unstandardised σ M I unstandardised ,

where M I unstandardised (J;K) is the MI of columns J and K in the MSA, and MI ¯ unstandardised and σ M I unstandardised are the average MI and estimated standard deviation respectively, over all interacting domains’ column pairs in the MSA, excluding pairs with an MI value of 0, involving a 0 entropy residue column or a gapped residue column.

Our calculated M I unstandardised scores vary from 0.0 to 1.6, while the standardised MI scores range from -3.2 to 3.7.

Calculating MIp

Dunn et al. proposed a variant of MI that aims to correct for background (random and phylogenetic) noise of each pair of columns under consideration, MIp[9] . This MI correction is denoted by the equation
MI p unstandardised ( J ; K ) = M I unstandardised ( J ; K ) APC ( J ; K ) ,
where M I unstandardised (J;K) is calculated as denoted in Equation (4). As previously, pairs involving a 0 entropy residue column or a gapped residue column, or having an MI score of 0 are ignored. APC(J;K), the average product correction, is a modification term for columns J and K in the MSA, evaluated as follows:
APC ( J ; K ) = MI ¯ unstandardised ( J ) MI ¯ unstandardised ( K ) MI ¯ unstandardised ,

where MI ¯ unstandardised ( J ) is the average mutual information for column J, MI ¯ unstandardised ( K ) is the average mutual information for column K, and MI ¯ unstandardised is the overall average mutual information.

As done previously for MI, MIp scores are also standardised (Equation (8)),
MIp ( J ; K ) = MI p unstandardised ( J ; K ) MIp ¯ unstandardised σ MI p unstandardised ,

where MI p unstandardised (J;K) is the MIp of columns J and K in the MSA and MIp ¯ unstandardised and σ MI p unstandardised are the average MIp and estimated standard deviation respectively, over all calculated column pairs in the protein.

Our MIp scores vary from 0.0 to 0.4, while the standardised MIp scores range from -3.1 to 7.0.

Calculating MIc

Lee and Kim designed normalising measures that aim to reduce phylogenetic noise in MI scores[25]. They begin with the coevolutionary pattern similarity score (CPS) that measures the similarity between the MI score patterns of the two residues being considered. It is denoted as follows,
CPS ( J ; K ) = 1 n 2 L J ; K M I unstandardised ( J ; L ) M I unstandardised ( K ; L ) .
Here M I unstandardised (J;L) is the MI score of the columns J and L, which is calculated as described in Equation (4). The number of columns in the MSA are denoted by n. Since the CPS is the product of two MI scores, it is then normalised by the square root of the mean of all CPS scores.
NCPS ( J ; K ) = CPS ( J ; K ) 1 n ( n 1 ) J , K CPS ( J ; K ) .
To adapt the NCPS for domain-domain prediction we consider only those CPS scores that refer to domain-domain column pairs, i.e. one column from each protein, and adjust n in Equation (10) accordingly. Once again MI values of 0 are ignored, as are 0 entropy and gapped residue columns. This NCPS score is then subtracted from the corresponding MI pair score to yield Lee and Kim’s noise reduced MI variant, MIc.
MI c unstandardised ( J ; K ) = M I unstandardised ( J ; K ) NCPS ( J ; K ) .
MIc scores for each protein are standardised in a manner similar to MI and MIp (Equations 5 and 8), so that MIc values from different proteins can be compared.
MIc ( J ; K ) = MI c unstandardised ( J ; K ) MIc ¯ unstandardised σ MI c unstandardised ,

where MI c unstandardised (J;K) is the MIc of columns J and K in the MSA and MIc ¯ unstandardised and σ MI c unstandardised are the average MIc and estimated standard deviation respectively, over all column pairs being considered in the protein.

The MIc scores calculated on our 40 test cases range from -0.02 to 0.1, while the standardised scores range from -2.4 to 7.7.

3-dimensional (3D) MI and MIp

MI[27] and MIp[9] were adapted to consider triangles of residues;
MI 3 D unstandardised ( J ; K ; L ) = j = 1 n k = 1 m l = 1 s P ( J = j , K = k , L = l ) × log P ( J = j , K = k , L = l ) P ( J = j ) P ( K = k ) P ( L = l ) ,

where MSA column J from domain 1 has n different residues, column K from domain 2 has m different residues, and column L from domain 2 has s different residues. Residues in the representative protein structure, corresponding to columns K and L, should be < 4.5Å from each other in order to be considered as being on the same patch in the domain.

MIp3D is defined as
MIp 3 D unstandardised ( J ; K ; L ) = MI 3 D unstandardised ( J ; K ; L ) APC 3 D ( J ; K ; L ) .
In this equation MI 3D unstandardised (J;K;L) is calculated as denoted in Equation (13) and APC 3D(J;K;L) is calculated as
APC 3 D ( J ; K ; L ) = MI 3 D ¯ unstandardised ( J ) MI 3 D ¯ unstandardised ( K ) MI 3 D ¯ unstandardised ( L ) MI 3 D ¯ unstandardised ,

where MI 3 D ¯ unstandardised ( J ) is the average 3D mutual information for column J, MI 3 D ¯ unstandardised ( K ) is the average 3D mutual information for column K, MI 3 D ¯ unstandardised ( L ) is the average 3D mutual information for column L, and MI 3 D ¯ unstandardised is the overall average 3D mutual information.

In order to compare the 3D mutual information scores between test cases, MI3D and MIp3D scores were standardised in a manner similar to those described in Equations (5), (8) and (12), respectively. Once again MI3D values of 0 were ignored, as were 0 entropy columns and columns containing one or more gaps.

Reduced Alphabet (RA) MI scores

We grouped the 20 amino acids into the same seven physiochemical categories successfully employed by Hamer et al. in their domain-domain contact predictor, i-Patch[12]. These seven categories include: Small (S,G,A,P), Hydrophobic (V,M,I,L,C), Negatively charged (D,E), Aromatic (F,Y,W), Polar (Q,T,N), Favoured Positively-charged (R,H), and Disfavoured Positively-charged (K). These physiochemical groups are abbreviated to S, H, N, A, P, F and D respectively. Hamer et al. introduced Favoured and Disfavoured categories because Lysine (K) was found to be rare in protein/domain interfaces (propensity 0.66), while Arginine (R) and Histidine (H) were far more common (propensities of 1.05 and 1.11, respectively)[12].

We replaced the amino acid alphabets in each MSA by their corresponding category abbreviation and recalculated MI, MIp, MIc, MI3D and MIp3D as described above. The five new MI variant scores are referred to as MIRA, MIpRA, MIcRA, MI3DRA and MIp3DRA.

We choose to employ this particular set of seven physiochemical categories as it was successfully used by i-Patch[12] in domain-domain contact prediction. We do not expect another grouping to dramatically improve the predictive capabilities of MI and its variants further.

P-ROC curves

For classification, each residue in all 40 test cases was assigned the maximum MI score that its residue column achieved with any other residue column in its MSA. When the average score of each residue was assigned instead, the performance of the MI variants decreased significantly, consequently the maximum score was employed.

When there is a disproportionate number of positive versus negative cases, P-ROC (Precision Recall Operating Characteristic) curves[52] provide an alternative to ROC (Receiver Operating Characteristic) curves[53] when attempting to evaluate the performance of a classifier. In our 40 test case dataset, contact residues constitute only 24.5% of all residues, thus the P-ROC will be more informative than the ROC curve. To calculate precision and recall the percentiles of the scores are used as cut-offs, where
precision = TP TP + FP ,
recall = TP TP + FN .

TP in these equations denote the number of true positives, FP denotes the number of false positives and FN symbolises the number of false negatives.

Each P-ROC plot contains a flat horizontal line (green) at TP total scores This line denotes the probability of randomly discriminating positive versus negative cases. For example, in Figure2 the solid green line is at 0.245 because there are 1342 “contact” scores out of 5483 total surface scores in the non-reduced alphabet test set. In this figure the dashed green line is at 0.244 because there are 1306 “contact” scores out of 5362 total surface scores in the reduced alphabet test set. Similarly, in Additional file1: Figure S1 the solid green line is at 0.699 as there are 5483 “surface” scores out of 7847 total scores in the non-reduced alphabet test set, while the dashed green line is at 0.712 for there are 5362 “surface” scores out of 7536 total scores in the non-reduced alphabet test set.

Sub-sampling to test stability of MI scores

To test the stability of the 10 MI variant scores under minor changes in the MSA, for each test case 70% of sequences in the MSA were randomly selected and all 10 MI scores recalculated and 10 respective P-ROC curves were plotted. This sub-sampling and calculation process was repeated 100 times per test case for every MI variant. Then the average and standard error of the precision values for the 100 P-ROC curves were calculated for each MI variant. The precision values at 20% recall for each of the MI variants are listed in Table2.

This sub-alignment creation and MI recalculation process was only carried out on those 24 test cases that had ≥200 sequences to ensure that a minimum of 125 sequences were retained in each sub-alignment, the suggested minimum number of sequences required to reduce the stochastic noise in the MSA[8].

MI scores of 0

Pairing any MSA column with a fully conserved column, i.e. a column with an entropy of 0, results in a joint entropy equivalent to the entropy of the non-fully conserved column and subsequently an MI score of 0 for that pair. Since conserved columns do not give any indication of correlated mutations, MI scores involving these columns are ignored. This is standard procedure; for example,[6]. The relationship between percent of columns in an MSA with an entropy of 0 and percent MI scores of 0 computed can be observed in Figure4. This approximately linear relationship further affirms the direct influence a column with an entropy of 0 has on the MI score of pairs involving that column.
Figure 4

Effect of entropies of 0 on MI scores. The percent of columns in an MSA that have an entropy of 0 is plotted against the percent of all domain-domain residue pairs in the corresponding complex that have an MI value of 0. Only those columns in the MSA that correspond to a residue in the reference structure are used. Columns that have one or more gaps are ignored. Each point on the plot represents a single case study in our domain-domain dataset.

Author’s contributions

MG participated in design of the study, wrote the code used for analysis and drafted the manuscript. RH assisted with coding, participated in design of the study and edited the manuscript. GR supervised this research, directed design of this study and the statistical analysis employed, and edited the manuscript. CMD conceived this piece of work, supervised the research, directed design of the study and edited the manuscript. The final manuscript was read and approved by all authors.



The authors would like to thank the members of the Oxford Protein Informatics Group for helpful discussion about this work and the Clarendon Fund Scholarship for making this research possible.

Authors’ Affiliations

Department of Statistics, University of Oxford


  1. Korber BT, Farber RM, Wolpert DH, Lapedes AS: Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. Proc Nat Acad Sci. 1993, 90 (15): 7176-7180. 10.1073/pnas.90.15.7176.PubMedPubMed CentralView Article
  2. Clarke ND: Covariation of residues in the homeodomain sequence family. Protein Sci. 1995, 4 (11): 2269-2278. 10.1002/pro.5560041104.PubMedPubMed CentralView Article
  3. Lockless SW, Ranganathan R: Evolutionarily conserved pathways of energetic connectivity in protein families. Science. 1999, 286 (5438): 295-299. 10.1126/science.286.5438.295.PubMedView Article
  4. Atchley WR, Wollenberg KR, Fitch WM, Terhalle W, Dress AW: Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. Mol Biol Evol. 2000, 17: 164-178. 10.1093/oxfordjournals.molbev.a026229.PubMedView Article
  5. Wollenberg KR, Atchley WR: Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. Proc Nat Acad Sci. 2000, 97 (7): 3288-3291. 10.1073/pnas.97.7.3288.PubMedPubMed CentralView Article
  6. Tillier ERM, Lui TWH: Using multiple interdependency to separate functional from phylogenetic correlations in protein alignments. Bioinformatics. 2003, 19 (6): 750-755. 10.1093/bioinformatics/btg072.PubMedView Article
  7. Bradford JR, Westhead DR: Improved prediction of protein–protein binding sites using a support vector machines approach. Bioinformatics. 2005, 21 (8): 1487-1494. 10.1093/bioinformatics/bti242.PubMedView Article
  8. Martin LC, Gloor GB, Dunn SD, Wahl LM: Using information theory to search for co-evolving residues in proteins. Bioinformatics. 2005, 21 (22): 4116-4124. 10.1093/bioinformatics/bti671.PubMedView Article
  9. Dunn SD, Wahl LM, Gloor GB: Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics. 2008, 24 (3): 333-340. 10.1093/bioinformatics/btm604.PubMedView Article
  10. Little DY, Chen L: Identification of coevolving residues and coevolution potentials emphasizing structure, bond formation and catalytic coordination in protein evolution. PLoS ONE. 2009, 4 (3): e4762-10.1371/journal.pone.0004762.PubMedPubMed CentralView Article
  11. Brown CA, Brown KS: Validation of coevolving residue algorithms via pipeline sensitivity analysis: ELSC and OMES and ZNMI, Oh My!. PLoS ONE. 2010, 5 (6): e10779-10.1371/journal.pone.0010779.PubMedPubMed CentralView Article
  12. Hamer R, Luo Q, Armitage JP, Reinert G, Deane CM: i-Patch: interprotein contact prediction using local network information. Proteins. 2010, 78 (13): 2781-2797. 10.1002/prot.22792.PubMedView Article
  13. Durbin R, Eddy SR, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. 1998, Cambridge University PressView Article
  14. Fitch WM, Markowitz E: An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem Genet. 1970, 4 (5): 579-593. 10.1007/BF00486096.PubMedView Article
  15. Poon A, Chao L: The rate of compensatory mutation in the DNA bacteriophage phiX174. Genetics. 2005, 170 (3): 989-999. 10.1534/genetics.104.039438.PubMedPubMed CentralView Article
  16. Yanofsky C, Horn V, Thorpe D: Protein structure relationships revealed by mutational analysis. Science. 1964, 146 (3651): 1593-1594. 10.1126/science.146.3651.1593.PubMedView Article
  17. Göbel U, Sander C, Schneider R, Valencia A: Correlated mutations and residue contacts in proteins. Proteins. 1994, 18 (4): 309-317. 10.1002/prot.340180402.PubMedView Article
  18. Pazos F: Correlated mutations contain information about protein-protein interaction. J Mol Biol. 1997, 271 (4): 511-523. 10.1006/jmbi.1997.1198.PubMedView Article
  19. Xu Y, Tillier ERM: Regional covariation and its application for predicting protein contact patches. Proteins. 2010, 78 (3): 548-558.PubMed
  20. Zhang QC, Petrey D, Norel R, Honig BH: Protein interface conservation across structure space. Proc Nat Acad Sci. 2010, 107 (24): 10896-10901. 10.1073/pnas.1005894107.PubMedPubMed CentralView Article
  21. Davis FP: Proteome-wide prediction of overlapping small molecule and protein binding sites using structure. Mol BioSystems. 2011, 7 (2): 545-557. 10.1039/c0mb00200c.View Article
  22. Horner DS, Pirovano W, Pesole G: Correlated substitution analysis and the prediction of amino acid structural contacts. Briefings in Bioinf. 2008, 9: 46-56.View Article
  23. Fodor AA, Aldrich RW: Influence of conservation on calculations of amino acid covariance in multiple sequence alignments. Proteins: Struct, Funct, Bioinf. 2004, 56 (2): 211-221. 10.1002/prot.20098.View Article
  24. Lichtarge O, Bourne HR, Cohen FE: An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol. 1996, 257 (2): 342-358. 10.1006/jmbi.1996.0167.PubMedView Article
  25. Lee BC, Kim D: A new method for revealing correlated mutations under the structural and functional constraints in proteins. Bioinformatics. 2009, 25 (19): 2506-2513. 10.1093/bioinformatics/btp455.PubMedView Article
  26. Halperin I, Wolfson H, Nussinov R: Correlated mutations: advances and limitations. a study on fusion proteins and on the cohesin-dockerin families. Proteins. 2006, 63 (4): 832-845. 10.1002/prot.20933.PubMedView Article
  27. Skerker JM, Perchuk BS, Siryaporn A, Lubin EA, Ashenberg O, Goulian M, Laub MT: Rewiring the specificity of two-component signal transduction systems. Cell. 2008, 133 (6): 1043-1054. 10.1016/j.cell.2008.04.040.PubMedPubMed CentralView Article
  28. Pagel P, Wong P, Frishman D: A domain interaction map based on phylogenetic profiling. J Mol Biol. 2004, 344 (5): 1331-1346. 10.1016/j.jmb.2004.10.019.PubMedView Article
  29. Mika S, Rost B: Protein–protein interactions more conserved within species than across species. PLoS Comput Biol. 2006, 2 (7): e79+-PubMedPubMed CentralView Article
  30. Overington J, Donnelly D, Johnson MS, Sali A, Blundell TL: Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds. Protein Sci. 1992, 1 (2): 216-226.PubMedPubMed CentralView Article
  31. Goldman N, Thorne JL, Jones DT: Assessing the impact of secondary structure and solvent accessibility on protein evolution. Genetics. 1998, 149: 445-458.PubMedPubMed Central
  32. Bustamante CD, Townsend JP, Hartl DL: Solvent accessibility and purifying selection within proteins of Escherichia coli and Salmonella enterica. Mol Biol Evol. 2000, 17 (2): 301-308. 10.1093/oxfordjournals.molbev.a026310.PubMedView Article
  33. Lin YS, Hsu WL, Hwang JK, Li WH: Proportion of solvent-exposed amino acids in a protein and rate of protein evolution. Mol Biol Evol. 2007, 24 (4): 1005-1011. 10.1093/molbev/msm019.PubMedView Article
  34. Kass I, Horovitz A: Mapping pathways of allosteric communication in GroEL by analysis of correlated mutations. Proteins. 2002, 48 (4): 611-617. 10.1002/prot.10180.PubMedView Article
  35. Halabi N, Rivoire O, Leibler S, Ranganathan R: Protein sectors: evolutionary units of three-dimensional structure. Cell. 2009, 138 (4): 774-786. 10.1016/j.cell.2009.07.038.PubMedPubMed CentralView Article
  36. Madaoui H, Guerois R: Coevolution at protein complex interfaces can be detected by the complementarity trace with important impact for predictive docking. Proc Nat Acad Sci. 2008, 105 (22): 7708-7713. 10.1073/pnas.0707032105.PubMedPubMed CentralView Article
  37. Holm L, Sander C: Parser for protein folding units. Proteins. 1994, 19 (3): 256-268. 10.1002/prot.340190309.PubMedView Article
  38. Siddiqui AS, Barton GJ: Continuous and discontinuous domains: an algorithm for the automatic generation of reliable protein domain definitions. Protein Sci. 1995, 4 (5): 872-884.PubMedPubMed CentralView Article
  39. Sowdhamini R, Blundell TL: An automatic method involving cluster analysis of secondary structures for the identification of domains in proteins. Protein Sci. 1995, 4 (3): 506-520.PubMedPubMed CentralView Article
  40. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The protein data bank. Nucleic Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235.PubMedPubMed CentralView Article
  41. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.PubMedView Article
  42. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView Article
  43. NCBI-NR Database.http://​www.​ncbi.​nlm.​nih.​gov,
  44. Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006, 22 (13): 1658-1659. 10.1093/bioinformatics/btl158.PubMedView Article
  45. Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32 (5): 1792-1797. 10.1093/nar/gkh340.PubMedPubMed CentralView Article
  46. Gouveia-Oliveira R, Sackett PW, Pedersen AG: MaxAlign: maximizing usable data in an alignment. BMC Bioinformatics. 2007, 8: 312-10.1186/1471-2105-8-312.PubMedPubMed CentralView Article
  47. Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Landsman D, Lipman DJ, Lu Z, Madden TL, Madej T, Maglott DR, Marchler-Bauer A, Miller V, Mizrachi I, Ostell J, Panchenko A, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Shumway M, et al: Database resources of the national center for biotechnology information. Nucleic Acids Res. 2010, 38 (Database issue): D5-D16.PubMedPubMed CentralView Article
  48. Mizuguchi K, Deane CM, Blundell TL, Johnson MS, Overington JP: JOY: protein sequence-structure representation and analysis. Bioinformatics. 1998, 14 (7): 617-623. 10.1093/bioinformatics/14.7.617.PubMedView Article
  49. Horton N, Lewis M: Calculation of the free energy of association for protein complexes. Protein Sci. 1992, 1: 169-181.PubMedPubMed CentralView Article
  50. Carugo O, Argos P: Protein-protein crystal-packing contacts. Protein Sci. 1997, 6 (10): 2261-2263.PubMedPubMed CentralView Article
  51. Camacho CJ, Weng Z, Vajda S, DeLisi C: Free energy landscapes of encounter complexes in protein-protein association. Biophys J. 1999, 76 (3): 1166-1178. 10.1016/S0006-3495(99)77281-4.PubMedPubMed CentralView Article
  52. Buckland M, Gey F: The relationship between recall and precision. J Am Society for Inf Sci. 1994, 45: 12-19. 10.1002/(SICI)1097-4571(199401)45:1<12::AID-ASI2>3.0.CO;2-L.View Article
  53. Fawcett T: An introduction to ROC analysis. Pattern Recognit Lett. 2006, 27 (8): 861-874. 10.1016/j.patrec.2005.10.010.View Article
  54. Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. 2000, 16 (5): 412-424. 10.1093/bioinformatics/16.5.412.PubMedView Article


© Gomes et al.; licensee BioMed Central Ltd. 2012

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://​creativecommons.​org/​licenses/​by/​2.​0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.