- Project Note
- Open Access
Progress in the PRIDE technique for rapidly comparing protein three-dimensional structures
© Kirillova and Carugo; licensee BioMed Central Ltd. 2008
- Received: 25 February 2008
- Accepted: 11 July 2008
- Published: 11 July 2008
Accurate and fast tools for comparing protein three-dimensional structures are necessary to scan and analyze large data sets.
The method described here is not only very fast but it is also reasonable precise, as it is shown by using the CATH database as a test set. Its rapidity depends on the fact that the protein structure is represented by vectors that monitors the distribution of the inter-residue distances within the protein core and the structure of which is optimized with the Freedman-Diaconis rule.
The similarity score is based on a χ2 test, the probability density function of which can be accurately estimated.
- Secondary Structural Element
- Protein Structure Alignment
- Structure Comparison Method
- CATH Database
- Structural Similarity Score
Although numerous methods for comparison protein three-dimensional (3D) structures were designed, we still lack a unique, commonly accepted procedure to measure the structural diversity between proteins . In particular, the structures of distantly related proteins should be expressed by the appropriate way allowing their comparison and the 3D structure representations used in modern algorithms are described in the reviews [2, 3]. The most accurate protein structure comparison methods produce protein structure alignments that are computationally intensive. Slower techniques may be preferable to analyze and classify sufficiently small data sets. However, the time criterion is crucial in the case of integrated survey of large databases, like the Protein Data Bank or the domain collections CATH and SCOP . This problem is very similar to that encountered few years ago in the case of macromolecular sequence databases, which was solved by the development of tools like FASTA , BLAST  or PSI-BLAST  that allow one to effectively scan enormous databases like UniProt , which presently contain several millions of entries. Although protein 3D structure databases are still much smaller, several representations of protein structure suitable for rapid comparison without alignment were proposed [9–13]. One of the fast and automatic techniques for protein structural comparison is PRIDE . In this method the protein structure is represented via a series of distributions of inter-atomic distances allowing the use rapid comparison procedure without alignment.
In the present communication, some improvements of the original PRIDE technology are presented. They make it more accurate than the original version without decreasing its speed. The classification ability of the method was tested on the CATH database.
The PRIDE methodology
In original PRIDE version, a protein structure in defined by the distributions of the distances between Cαi and Cα(i+n) atoms, where n, which ranges from 3 to 30, is the number of C α atoms between them in the backbone joint. The comparison between two protein 3D structures is reduced to the comparison between distributions of inter-residue distances. This is performed by chi-square contingency table analysis, which estimates whether two distributions represent the same overall population and allows one to compute a probability of identity P, ranging from 0 and 1. Since 28 pairs of histograms are compares, 28 P values are obtained and then averaged to give the overall PRobability of IDEntity (PRIDE) between the two protein 3D structures. Such a similarity score can range, by definition, from 0 to 1, the latter value indicating the identity between the two protein structures. In the next sections, four modifications, introduced into this computational procedure, will be described.
Amount of structural information
The maximal value of n, which was equal to 30 in the old PRIDE version, is now selected as a function of the protein dimension. Obviously, the histograms, in which inter-residue distances are binned, must have a sufficiently high number of observations to be compared via any statistical tool. The number of observations in the histograms increases with the length of the protein and decreases with n. Therefore, histograms were generated for all n values larger than 3 and lower than nmax, where nmax is the value for which there are only 20 Cαi-Cα(i+n) distances. Clearly, if n > nmax, the histograms would contain less than 20 observations and they were thus ignored. Therefore, the numbers of histograms are different for proteins of different length in the modified PRIDE version. In the comparison of two domains, represented by series of Cαi-Cα(i+n) histograms, with 3 ≤ n ≤ nmax1 for the first domain and 3 ≤ n ≤ nmax2 for the second domain, the maximal value of n (nmax) was defined as
nmax = min(nmax1, nmax2)
Moreover, only distances between residues belonging to helices and/or strands were taken into account in the modified PRIDE version, in order to increase the computational speed of the method. The STRIDE package, based on the detection of hydrogen bonds patterns and backbone torsions, was used for secondary structure assignment .
Optimization of the dimension of the histogram intervals
The building of a regular histogram from continuous data demands a cautious specification of the number of bins. In the old version of PRIDE, each bin width was arbitrarily set to 0.5 Å, and adjacent bins were merged together so that at least 5% of the observations were included in each bin. Here a more rigorous approach was followed. Firstly, inter-residue distances were binned in the histograms with a fixed bin width of 0.1 Å, a value close to the average expected uncertainty of protein atomic coordinates obtained with crystallographic methods . Then bin widths are changed automatically to their optimal value BS by using the Freedman-Diaconis rule 
BS = 2iqr(x)k-1/3
where k is the number of observations in the sample x; iqr(x) is the interquartile range of the data of sample x, that is the range between the third and first quartiles. The iqr is expected to include about half of the data. The optimal BS values are computed for a query protein structure, and then they are used to change the histogram bins for all domains in the scanned database. New optimal BS values must be recomputed for a new query. Despite this might seem to be rather complicated and time consuming, we verified that once the histograms for the entire database are pre-computer and stored with very small bins of 0.1 Å, all of them can be re-shaped to the optimal BS very rapidly (see the paragraph "Computational speed" below).
χ2 ranges from 0 to the positive infinity. A large value of χ2 indicates that the null hypothesis is rather unlikely and that the two proteins are considerably different, and χ2 can thus be used as a statistical measure of proximity between two protein 3D structures. On the contrary, two identical protein 3D models are associated with a χ2 value equal to 0.
The content of the datasets and the query lists used for PRIDE testing
Number of domains in the dataset
Number of histograms used for the domain structure representation
Number of domains in the query list
10 – 30
The new structure comparison method was benchmarked against the CATH v3.0.0 database , which is a hierarchical classification of protein domains according to the class C (prevalence of secondary structural types), architecture A (the number, type, and reciprocal orientation of the secondary structural elements), topology T (the topological connection of the secondary structural elements) and homologous superfamily H (a common evolutionary origin supported either by significant sequence similarity or significant structural and functional similarity). Two datasets were created (Table 1), one with domains large enough to be represented by at least 30 distributions of Cαi-Cα(i+n) distances, and the other with smaller domains, for which 10 < nmax < 30. Domains containing more then one polypeptide chain were disregarded since, by definition, PRIDE cannot handle them.
A non-redundant series of CATH entries were randomly selected from different superfamilies to be used as queries, by ensuring that all the three principal classes C of the database are equally represented (Table 1). Some were large domains (nmax > 30) and other small domains (10 < nmax < 30). About half of them were considered to be "easy" queries, in the sense that they belong to a CATH fold cluster containing at least 50 domains, and the others were "difficult" queries that belong to small CATH fold groups having no more than 3 domains.
The performance of the new PRIDE version can be examined by the computation and the analysis of the ROC curves. The P value, which is a similarity score, is used to calculate ROC curve in the present study. A threshold similarity is consecutively decreased, with subsequent decrements equal to 0.01, in the entire range of possible P values, from 1 to 0. At each step, each of the queries (Table 1) was compared to all the entries of the databases (Table 1). As a consequence, 4,335,602 comparisons were performed by considering the dataset of large protein domains and 207,354 comparisons were necessary by considering the dataset of small protein domains.
Each comparison can be classified in one of four categories, according to the CATH classification of two domains and their P value. It can be i) a true positive (TP), if the similarity between the query and the entry is higher that the threshold value and if the query and the entry belong to the same CATH fold; ii) false positive (FP) if the similarity between the query and the entry is higher that the threshold value despite the fact that they have different CATH classification; iii) a false negative (FN), if the entry and the query are in the same fold cluster despite their estimated similarity is lower than the threshold value; iv) a true negative (TN), if the similarity is estimated to be smaller that the threshold value and if the query and the entry are actually classified into different CATH fold groups. On the basis of these definitions it is possible to compute, for each threshold value, the sensitivity and the specificity
Sensitivity = TP/(TP + FN)
Specificity = TN/(TN + FP)
This work was supported by the BIN-II network of the GEN-AU Austrian project.
- Kolodny R, Petrey D, Honig B: Protein structure comparison: implications for the nature of 'fold space', and structure and function prediction. Curr Opin Struct Biol. 2006, 16 (3): 393-398.View ArticlePubMedGoogle Scholar
- Carugo O: Rapid methods for comparing protein structures and scanning structure databases. Curr Bioinformatics. 2006, 1: 75-83.View ArticleGoogle Scholar
- Carugo O: Recent progress in measuring structural similarity between proteins. Curr Protein Pept Sci. 2007, 8: 219-241.View ArticlePubMedGoogle Scholar
- Aung Z, Tan KL: Rapid retrieval of protein structures from databases. Drug Disco Today. 2007, in press:Google Scholar
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA. 1988, 85: 2444-2448.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basil local alignment search tool. J Mol Biol. 1990, 215: 403-410.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res. 1997, 25: 3389-3402.PubMed CentralView ArticlePubMedGoogle Scholar
- Leinonen R, Diez FG, Binns D, Fleischmann W, Lopez R, Apweiler R: UniProt archive. Bioinformatics. 2004, 20: 3226-3227.View ArticleGoogle Scholar
- Carugo O, Pongor S: Protein fold similarity estimated by a probabilistic approach based on C(alpha)-C(alpha) distance comparison. J Mol Biol. 2002, 315 (4): 887-898.View ArticlePubMedGoogle Scholar
- Rogen P, Fain B: Automatic classification of protein structure by using Gauss integrals. Proc Natl Acad Sci USA. 2003, 100: 119-124.PubMed CentralView ArticlePubMedGoogle Scholar
- Bostick DL, Shen M, Vaisman: A simple topological representation of protein structure: implications for new, fast, and robust structural classification. Proteins. 2004, 56 (3): 487-501.View ArticlePubMedGoogle Scholar
- Zotenko E, Dogan RI, Wilbur WJ, O'Leary DP, Przytycka TM: Structural footprinting in protein structure comparison: the impact of structural fragments. BMC Struct Biol. 2007, 7: 53-PubMed CentralView ArticlePubMedGoogle Scholar
- Choi IG, Kwon J, Kim SH: Local feature frequency profile: a method to measure structural similarity in proteins. Proc Natl Acad Sci USA. 2004, 101: 3797-3802.PubMed CentralView ArticlePubMedGoogle Scholar
- Frishman D, Argos P: Knowledge-based protein secondary structure assignment. Proteins. 1995, 23: 566-579.View ArticlePubMedGoogle Scholar
- Cruickshank DWJ: Coordinate uncertainty. International Tables for Crystallography. Edited by: Rossmann MG, Arnold E. 2001, Dordrecht , Kluwer Academic Publisher, F: 403-418.View ArticleGoogle Scholar
- Freedman D, Diaconis P: On the histogram as a density estimator: L2 theory. Probability Theory and Related Fields. 1081, 57: 453-476.Google Scholar
- Dowdy S, Wearden S, Chilko D: Statistics for research. 2004, Hoboken , John Wiley & SonsView ArticleGoogle Scholar
- Sam V, Tai CH, Garnier J, Gibrat JF, Lee B, Munson PJ: ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification. BMC Bioinformatics. 2006, 7: 206-PubMed CentralView ArticlePubMedGoogle Scholar
- Novotny M, Madsen D, Kleywegt GJ: Evaluation of protein fold comparison servers. Proteins. 2004, 54: 260-270.View ArticlePubMedGoogle Scholar
- Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH--a hierarchical classification of protein domain structures. Structure. 1997, 5: 1093-1108.View ArticlePubMedGoogle Scholar
- Website of Department of Biomolecular Structural Chemistry: [http://www.univie.ac.at/biochem/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.