PCP-ML: Protein characterization package for machine learning
© Eickholt and Wang; licensee BioMed Central Ltd. 2014
Received: 27 January 2014
Accepted: 31 October 2014
Published: 18 November 2014
Machine Learning (ML) has a number of demonstrated applications in protein prediction tasks such as protein structure prediction. To speed further development of machine learning based tools and their release to the community, we have developed a package which characterizes several aspects of a protein commonly used for protein prediction tasks with machine learning.
A number of software libraries and modules exist for handling protein related data. The package we present in this work, PCP-ML, is unique in its small footprint and emphasis on machine learning. Its primary focus is on characterizing various aspects of a protein through sets of numerical data. The generated data can then be used with machine learning tools and/or techniques. PCP-ML is very flexible in how the generated data is formatted and as a result is compatible with a variety of existing machine learning packages. Given its small size, it can be directly packaged and distributed with community developed tools for protein prediction tasks.
Source code and example programs are available under a BSD license at http://mlid.cps.cmich.edu/eickh1jl/tools/PCPML/. The package is implemented in C++ and accessible as a Python module.
Machine Learning (ML) techniques have been successfully applied to a variety of protein related classification tasks. In particular, machine learning has proven quite useful in the area of protein structure prediction and resulted in the development of a number of tools and particular applications. These include the prediction of a protein’s secondary structure [1, 2], residue solvent accessibility , residue-residue contacts and contact maps [3, 4], residue order/disorder [5, 6], fold recognition  and protein model quality . These tools, while useful in their own right, also form part of larger protein structure tools and tertiary structure prediction pipelines (e.g., MULTICOM  and I-TASSER ).
In general, machine learning methods work on a feature space which characterizes an object or event. The machine learning methods attempt to learn some meaningful relation between elements in the feature space and/or a map between the feature space and classifications. For most protein prediction tasks, the primary feature space is the protein’s sequence and/or data directly derived thereof (e.g., sequence profile). As machine learning techniques are mathematical models, the sequence data (e.g., FASTA files, multiple sequence alignments, etc.) must be read in and then converted to a meaningful numerical format.
Major methods provided by each component of PCP-ML
Parsers and encoders
Feature writers and generators
As almost all protein structure prediction tasks start with the protein’s sequence and sequence profile, PCP-ML provides several methods to parse FASTA files, anchored multiple sequence alignments (MSAs), output from DSSP , and position specific scoring matrices (PSSMs) from PSI-BLAST . Many higher level prediction tasks also make use of predicted secondary structure and predicted solvent accessibility. Therefore, we have included parsers for common output formats for these types of predictions. In particular, PCP-ML can read files generated from SSPro  and PSIPRED .
Here, we note that we do not include a parser for PDB files (i.e., a common format used by the Protien Data Bank  for protein structure). Our rationale for not including a PDB parser is that for most prediction tasks, the structural information that would be contained in a PDB file is not available and hence a PDB parser is not needed for the production of end-user protein structure prediction tools. Interested readers may find a PDB parser included with Biopython  or ESBTL .
Characterizers and encoders
The input into machine learning methods is numerical and as a result it is necessary to encode data such as secondary structure (SS), solvent accessibility (SA) and amino acid (AA) type. One approach to this end is to convert each SS, SA and AA type to vectors of length 3, 2 and 20, respectively. In each vector, all of the values are 0 except for one value which depending on its position in the vector signifies the type (e.g. 100 represents a helix while 001 encodes for a coil). This type of encoding is often referred to as hot encoding, or orthogonal encoding , and allows for a numerical conversion without arbitrarily imposing an ordering on the encoding. PCP-ML contains three methods for hot encoding.
There are a number of ways to characterize a protein’s sequence and the PCP-ML package includes many of these. Perhaps the most obvious is to represent amino acid residues by numerical values stemming from statistical studies on experimentally determined structures. Included in PCP-ML are pair-wise contact potentials , beta sheet pairing potentials , and hydrophobicity . We also included the Atchley factors for each amino acid . These factors represent each amino acid type in a five dimensional space in which similar amino acids are grouped together and the proximity of any two amino acids is a measure of their similarity.
Proteins can also be characterized by their content. PCP-ML contains methods which calculate the percent content of a protein by secondary structure type (i.e., helix, beta sheet, or coil/loop), solvent accessibility (i.e., buried or exposed) or amino acid residue type. This information is a way to characterize a protein globally (i.e., irrespective of residue index). This approach can also be applied at the residue level using an anchored MSA or PSSM. Using either an MSA or PSSM, it is possible to calculate the relative frequency of each type of AA at a position in the sequence as well as the amount of information contained at that position.
Description of each Characterizer contained in PCP-ML
Name of characterizer
Brief description of functionality provided
Characterizes five major aspects of an amino acid with real number values. The values were obtained via a statistical analysis of amino acids when looking at polarity, secondary structure, molecular size , amino acid composition and charge. These values were reported in .
Characterizes contact potential between two residues. These contact potentials come from a statistical analysis performed on contacts in protein interfaces. They were reported in .
Characterizes the contact potential for two residues in two beta sheets. These values come from a study of contact potentials of residues in cross strand pairings in beta sheets. They were reported in .
Determine the percentage of each secondary structure (SS) type in a string representing the secondary structure of the entire protein.
Determine the percentage of solvent accessibility from a string representing the solvent accessibility of the entire protein.
Determine the percentage of each amino acid in a protein sequence.
Characterizes the hydrophobicity of a residue. These values come from a study on hydrophobicity and helical propensity in .
Calculates the Pearson correlation coefficient for the elements of two feature vectors.
Calculates the cosine between two feature vectors.
Calculates the nth ordered mean for the Amino Acid, Secondary Structure or Solvent Accessibility string.
Calculates the Shannon entropy for a vector of probabilities
The input format for standard machine learning packages (e.g., SVMlight, NNrank, etc.) varies but typically consists of a text file in which each line represents a training or classification example. Some packages require the features to be numbered as well. PCP-ML provides feature writers which can print out features (optionally with number) and/or save them to a file. This functionality allows users to use PCP-ML to create stand-alone feature generation programs that they can package with standard machine learning programs or tie feature generation directly into their tools. Note that it is difficult to accommodate file formats for all machine learning packages. The feature writers we developed and included allow a user to print the features along with feature numbers and/or the labels/targets themselves. The targets and feature numbers can be easily modified via the parameters passed to the feature writing functions.
PCP-ML is a software package to characterize proteins for machine learning applications in protein structure prediction as well as more general protein related prediction tasks. It provides a number of functions that allow for rapid prototyping and testing of methods and easy deployment of developed tools. The package can be used to create feature generation programs compatible with popular machine learning tools or compiled into stand-alone applications. As an open source project, it is freely available to the community and can be modified and extended as needed.
Availability and requirements
Project name: PCP-ML
Project home page: http://mlid.cps.cmich.edu/eickh1jl/tools/PCPML/
Operating System(s): Linux, Mac OS X
Programming Language: C++, Python
Other requirements: C++ compiler
Any restrictions to use by non-academics: None
PCP-ML is written in C++ and available in both source code and a Python module. These are available at http://mlid.cps.cmich.edu/eickh1jl/tools/PCPML/. At the site, users can also find examples, a tutorial, access additional documentation and learn about porting the package to other languages such as Perl or Octave.
This work was supported in part by a start-up grant from Central Michigan University to JE and a NSF Mississippi EPSCoR seed grant GM006278 to ZW.
- Cheng J, Randall AZ, Sweredoski MJ, Baldi P: SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res. 2005, 33: W72-W76. 10.1093/nar/gki396.PubMedPubMed CentralView ArticleGoogle Scholar
- Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 1999, 292: 195-202. 10.1006/jmbi.1999.3091.PubMedView ArticleGoogle Scholar
- Di Lena P, Nagata K, Baldi P: Deep architectures for protein contact map prediction. Bioinform Oxf Engl. 2012, 28: 2449-2457. 10.1093/bioinformatics/bts475.View ArticleGoogle Scholar
- Eickholt J, Cheng J: Predicting protein residue-residue contacts using deep networks and boosting. Bioinform Oxf Engl. 2012, 28: 3066-3072. 10.1093/bioinformatics/bts598.View ArticleGoogle Scholar
- Walsh I, Martin AJM, Di Domenico T, Tosatto SCE: ESpritz: accurate and fast prediction of protein disorder. Bioinform Oxf Engl. 2012, 28: 503-509. 10.1093/bioinformatics/btr682.View ArticleGoogle Scholar
- Eickholt J, Cheng J: DNdisorder: predicting protein disorder using boosting and deep networks. BMC Bioinform. 2013, 14: 88-10.1186/1471-2105-14-88.View ArticleGoogle Scholar
- Cheng J, Baldi P: A machine learning information retrieval approach to protein fold recognition. Bioinformatics. 2006, 22: 1456-1463. 10.1093/bioinformatics/btl102.PubMedView ArticleGoogle Scholar
- Wang Z, Eickholt J, Cheng J: APOLLO: a quality assessment service for single and multiple protein models. Bioinformatics. 2011, 27: 1715-1716. 10.1093/bioinformatics/btr268.PubMedPubMed CentralView ArticleGoogle Scholar
- Li J, Deng X, Eickholt J, Cheng J: Designing and benchmarking the MULTICOM protein structure prediction system. BMC Struct Biol. 2013, 13: 2-10.1186/1472-6807-13-2.PubMedPubMed CentralView ArticleGoogle Scholar
- Xu D, Zhang J, Roy A, Zhang Y: Automated protein structure modeling in CASP9 by I-TASSER pipeline combined with QUARK-based ab initio folding and FG-MD-based structure refinement. Proteins. 2011, 79 (Suppl 10): 147-160.PubMedPubMed CentralView ArticleGoogle Scholar
- Dutheil J, Gaillard S, Bazin E, Glémin S, Ranwez V, Galtier N, Belkhir K: Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics. BMC Bioinform. 2006, 7: 188-10.1186/1471-2105-7-188.View ArticleGoogle Scholar
- Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, Hoon MJL D: Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009, 25: 1422-1423. 10.1093/bioinformatics/btp163.PubMedPubMed CentralView ArticleGoogle Scholar
- Döring A, Weese D, Rausch T, Reinert K: SeqAn An efficient, generic C++ library for sequence analysis. BMC Bioinform. 2008, 9: 11-10.1186/1471-2105-9-11.View ArticleGoogle Scholar
- Joachims T: Advances in Kernel Methods. Edited by: Schölkopf B, Burges CJC, Smola AJ. 1999, Cambridge, MA, USA: MIT Press, 169-184.Google Scholar
- Cheng J, Wang Z, Pollastri G: A neural network approach to ordinal regression. IEEE Int. Jt. Conf. Neural Networks 2008 IJCNN 2008 IEEE World Congr. Comput. Intell. 2008, 1279-1284.View ArticleGoogle Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983, 22: 2577-2637. 10.1002/bip.360221211.PubMedView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The protein data bank. Nucleic Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235.PubMedPubMed CentralView ArticleGoogle Scholar
- Loriot S, Cazals F, Bernauer J: ESBTL: efficient PDB parser and data structure for the structural and geometric analysis of biological macromolecules. Bioinformatics. 2010, 26: 1127-1128. 10.1093/bioinformatics/btq083.PubMedView ArticleGoogle Scholar
- Baldi P, Brunak S: Bioinformatics: The Machine Language Approach. 2001, Cambridge: MIT PressGoogle Scholar
- Glaser F, Steinberg DM, Vakser IA, Ben-Tal N: Residue frequencies and pairing preferences at protein-protein interfaces. Proteins. 2001, 43: 89-102. 10.1002/1097-0134(20010501)43:2<89::AID-PROT1021>3.0.CO;2-H.PubMedView ArticleGoogle Scholar
- Zhu H, Braun W: Sequence specificity, statistical potentials, and three-dimensional structure prediction with self-correcting distance geometry calculations of beta-sheet formation in proteins. Protein Sci Publ Protein Soc. 1999, 8: 326-342.View ArticleGoogle Scholar
- Monera OD, Sereda TJ, Zhou NE, Kay CM, Hodges RS: Relationship of sidechain hydrophobicity and α-helical propensity on the stability of the single-stranded amphipathic α-helix. J Pept Sci. 1995, 1: 319-329. 10.1002/psc.310010507.PubMedView ArticleGoogle Scholar
- Atchley WR, Zhao J, Fernandes AD, Drüke T: Solving the protein sequence metric problem. Proc Natl Acad Sci U S A. 2005, 102: 6395-6400. 10.1073/pnas.0408677102.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.