PRED_PPI: a server for predicting protein-protein interactions based on sequence data with probability assignment
© Li et al; licensee BioMed Central Ltd. 2010
Received: 23 March 2010
Accepted: 26 May 2010
Published: 26 May 2010
Protein-protein interactions (PPIs) are crucial for almost all cellular processes, including metabolic cycles, DNA transcription and replication, and signaling cascades. Given the importance of PPIs, several methods have been developed to detect them. Since the experimental methods are time-consuming and expensive, developing computational methods for effectively identifying PPIs is of great practical significance.
Most previous methods were developed for predicting PPIs in only one species, and do not account for probability estimations. In this work, a relatively comprehensive prediction system was developed, based on a support vector machine (SVM), for predicting PPIs in five organisms, specifically humans, yeast, Drosophila, Escherichia coli, and Caenorhabditis elegans. This PPI predictor includes the probability of its prediction in the output, so it can be used to assess the confidence of each SVM prediction by the probability assignment. Using a probability of 0.5 as the threshold for assigning class labels, the method had an average accuracy for detecting protein interactions of 90.67% for humans, 88.99% for yeast, 90.09% for Drosophila, 92.73% for E. coli, and 97.51% for C. elegans. Moreover, among the correctly predicted pairs, more than 80% were predicted with a high probability of ≥0.8, indicating that this tool could predict novel PPIs with high confidence.
Based on this work, a web-based system, Pred_PPI, was constructed for predicting PPIs from the five organisms. Users can predict novel PPIs and obtain a probability value about the prediction using this tool. Pred_PPI is freely available at http://cic.scu.edu.cn/bioinformatics/predict_ppi/default.html.
Protein-protein interactions (PPIs) are essential for almost all cellular processes. Currently, PPIs discovered by experimental methods are absolutely insufficient for examining the complete PPI networks . Consequently, computational tools for effectively identifying PPIs are increasingly important. Current computational methods can be classified into two main approaches. The first is based on genomic  or structural information of proteins [3, 4]. However, these methods cannot be implemented if prior information about the proteins is not available. The second approach is based on protein primary sequences [5–7].
In general, a PPI predictor should be able to provide the probability estimation for its prediction in the output. However, most methods for PPI prediction were developed for only one particular species, and do not include a probability estimation. The sequence-based method proposed by Guo et al.  yields a good performance when applied to predicting PPIs of Saccharomyces cerevisiae. Therefore, we extended the application of the method to additional organisms. PPI prediction models were constructed for humans, yeast, Drosophila, Escherichia coli, and Caenorhabditis elegans, with a probability assignment for each support vector machine (SVM) prediction. The web-server Pred_PPI was developed for free use to predict novel PPIs with probability assignments.
Materials and methods
Interaction information for human proteins was from the Human Protein References Database (HPRD), release 7_20070901 . The PPI data for yeast, Drosophila, E. coli, and C. elegans were from the Database of Interacting Proteins (DIP), version DIP_20070219 . After removing protein pairs that contained a protein of less than 50 amino acids, 37027 PPIs remained in the dataset for humans, 5943 for yeast, 22975 for Drosophila, 6954 for E. coli, and 4030 for C. elegans. Noninteracting pairs were determined based on protein subcellular localization information, as described by Guo et al. . Negative datasets were built, and the number of negative pairs was equal to the positive pairs. For each organism, the entire dataset was partitioned into a training set and a test set (detailed description in Additional File 1). To minimize the data dependence on the prediction model, the sampling process was repeated five times, generating five training sets and five test sets. Each model was evaluated by averaging the prediction results of the five test sets.
Classifications were implemented using libsvm 2.84 . This software predicts class label and probability information. Details about the method of extending SVM for probability estimates are in Wu et al. . Choosing radial basic function as the kernel function, two parameters, the regularization parameter C, and the kernel width parameter γ were optimized using a grid search approach.
Results and Discussion
Prediction results of the test sets for five organisms with probability threshold of 0.5.
A. For Human PPI prediction
90.67 ± 0.17
B. For Yeast PPI prediction
88.99 ± 0.75
C. For Drosophila PPI prediction
90.09 ± 8.39
D. For E.coli PPI prediction
92.73 ± 3.94
E. For C.elegans PPI prediction
97.51 ± 0.22
Finally, to further verify the general performance of this method, a test set of human PPIs was constructed. Recently published data was collected from HPRD Release 8_20090706, by excluding PPIs from HPRD release 7_20070901. The test set contained 2201 PPIs that were not included in the entire training set. For predicting human PPIs, the Shen et al. method  achieved the highest accuracy, with 83.9%. Therefore, we used this test set for an unbiased evaluation of the method developed here, and the Shen et al. method . Comparison results are in Table S4 (Additional File 2). Using the default probability threshold of 0.5, 2106 PPIs were correctly predicted by our method with a prediction accuracy of 93.59%. The Shen et al. method  predicted 1479, with an accuracy of only 66.88%. Moreover, among the correctly predicted PPIs, 89.50% (1885 PPIs) had a high interaction probability of ≥0.9 by our method, while only 66.78% were predicted with ≥0.9 interaction probability by the Shen et al. method . To avoid homology bias in the prediction result, all proteins in the test set were aligned with those in the training set using the BLASTCLUST program . We removed protein pairs in the test set with a ≥25% pairwise sequence identity to those in the training set. The remaining 1983 PPIs comprised an independent dataset. The prediction results of our method using this independent dataset are also in Table S4 (Additional File 2). The method still achieved a high accuracy of 93.09% for the independent dataset, and 90% of the correctly predicted PPIs had a ≥0.9 interaction probability. These results indicated that the newly developed method not only provided a powerful general performance, but also gave high-confidence predictions.
Web server for PPI prediction
This work was funded by the National Natural Science Foundation of China (No. 20775052, 20905054) and the Specialized Research Fund for the Doctoral Program of Higher Education (No. 20090181120058).
- Han JD, Dupuy D, Bertin N, Cusick ME, Vidal M: Effect of sampling on topology predictions of protein-protein interaction networks. Nat Biotechnol. 2005, 23: 839-844. 10.1038/nbt1116.PubMedView ArticleGoogle Scholar
- Juan D, Pazos F, Valencia A: High-confidence prediction of global interactomes based on genome-wide coevolutionary networks. Proc Natl Acad Sci USA. 2008, 105: 934-939. 10.1073/pnas.0709671105.PubMed CentralPubMedView ArticleGoogle Scholar
- Singhal M, Resat H: A domain-based approach to predict protein-protein interactions. BMC Bioinformatics. 2007, 8: 199-10.1186/1471-2105-8-199.PubMed CentralPubMedView ArticleGoogle Scholar
- Burger L, van Nimwegen E: Accurate prediction of protein-protein interactions from sequence alignments using a Bayesian method. Mol Syst Biol. 2008, 4: 165-10.1038/msb4100203.PubMed CentralPubMedView ArticleGoogle Scholar
- Chou KC, Cai YD: Predicting protein-protein interactions from sequences in a hybridization space. J Proteome Res. 2006, 5: 316-322. 10.1021/pr050331g.PubMedView ArticleGoogle Scholar
- Shen JW, Zhang J, Luo XM, Zhu WL, Yu KQ, Chen KX, Li YX, Jiang HL: Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci USA. 2007, 104: 4337-4341. 10.1073/pnas.0607879104.PubMed CentralPubMedView ArticleGoogle Scholar
- Guo YZ, Yu LZ, Wen ZN, Li ML: Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res. 2008, 36: 3025-3030. 10.1093/nar/gkn159.PubMed CentralPubMedView ArticleGoogle Scholar
- Peri S, Navarro JD, Kristiansen TZ, Amanchy R, Surendranath V, Muthusamy B, Gandhi TK, Chandrika KN, Deshpande N, Suresh S: Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res. 2004, 32: D497-D501. 10.1093/nar/gkh070.PubMed CentralPubMedView ArticleGoogle Scholar
- Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP: The database of interacting proteins. A research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002, 30: 303-305. 10.1093/nar/30.1.303.PubMed CentralPubMedView ArticleGoogle Scholar
- LIBSVM -- A Library for Support Vector Machines. [http://www.csie.ntu.edu.tw/~cjlin/libsvm/]
- Wu TF, Lin CJ, Weng RC: Probability estimates for multi-class classification by pairwise coupling. J Mach Learn Res. 2004, 5: 975-1005.Google Scholar
- von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, Jouffre N, Huynen MA, Bork P: STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res. 2005, 33: D433-437. 10.1093/nar/gki005.PubMed CentralPubMedView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.