- Technical Note
- Open Access
Searching the protein structure database for ligand-binding site similarities using CPASS v.2
© Powers et al; licensee BioMed Central Ltd. 2011
- Received: 4 October 2010
- Accepted: 26 January 2011
- Published: 26 January 2011
A recent analysis of protein sequences deposited in the NCBI RefSeq database indicates that ~8.5 million protein sequences are encoded in prokaryotic and eukaryotic genomes, where ~30% are explicitly annotated as "hypothetical" or "uncharacterized" protein. Our C omparison of P rotein A ctive-S ite S tructures (CPASS v.2) database and software compares the sequence and structural characteristics of experimentally determined ligand binding sites to infer a functional relationship in the absence of global sequence or structure similarity. CPASS is an important component of our F unctional A nnotation S creening T echnology by NMR (FAST-NMR) protocol and has been successfully applied to aid the annotation of a number of proteins of unknown function.
We report a major upgrade to our CPASS software and database that significantly improves its broad utility. CPASS v.2 is designed with a layered architecture to increase flexibility and portability that also enables job distribution over the Open Science Grid (OSG) to increase speed. Similarly, the CPASS interface was enhanced to provide more user flexibility in submitting a CPASS query. CPASS v.2 now allows for both automatic and manual definition of ligand-binding sites and permits pair-wise, one versus all, one versus list, or list versus list comparisons. Solvent accessible surface area, ligand root-mean square difference, and Cβ distances have been incorporated into the CPASS similarity function to improve the quality of the results. The CPASS database has also been updated.
CPASS v.2 is more than an order of magnitude faster than the original implementation, and allows for multiple simultaneous job submissions. Similarly, the CPASS database of ligand-defined binding sites has increased in size by ~ 38%, dramatically increasing the likelihood of a positive search result. The modification to the CPASS similarity function is effective in reducing CPASS similarity scores for false positives by ~30%, while leaving true positives unaffected. Importantly, receiver operating characteristics (ROC) curves demonstrate the high correlation between CPASS similarity scores and an accurate functional assignment. As indicated by distribution curves, scores ≥ 30% infer a functional similarity. Software URL: http://cpass.unl.edu.
- True Positive
- Protein Data Bank
- Receiver Operating Characteristic Curve
- Enzyme Commission
- Ligand Binding Site
The C omparison of P rotein A ctive-S ite S tructures (CPASS)  is an integral component of our FAST-NMR methodology [2, 3] to annotate proteins of unknown function. CPASS is based on the premise that ligand-binding sites or functional epitopes are more evolutionary stable relative to the remainder of the protein [4, 5]. Thus, a protein of unknown function is annotated by identifying proteins of known function that share similar ligand-binding sites . The FAST-NMR and CPASS methodology is well-suited to situations where global sequence or structure similarity has failed to assign a function . CPASS has contributed to a functional hypothesis for the Staphylococcus aureus protein SAV1430 , Pseudomonas aeruginosa protein PA1324 , Pyrococcus horikoshii OT3 protein PH1320 , and human protein Q13206 . The basic CPASS approach has been used to provide an annotation to the Bacillus subtilis protein YndB . Also, CPASS was used to identify a functional relationship between the bacterial type III secretion system and eukaryotic apoptosis .
CPASS compares the sequence and structural characteristics between experimental ligand-defined active-sites or functional epitopes to identify a functional relationship. This is uniquely different from other bioinformatic tools such as eF-seek , PINTS [12, 13], ProFunc , and many others  that attempt to predict the location of ligand-binding sites based on structural features such as spatially conserved residues, surface pockets, or other physiochemical properties. Programs such as @TOME-2 , 3DLigandSite , and firestar  predict ligand binding sites or functional similarity through the global alignment of protein structures, where reference structures contain bound ligands. Conversely, ProteMiner-SSM , Query3d , and SiteBase  are similar in concept to CPASS, where only ligand-binding site substructures are used as a database query. CPASS uses the entire binding site defined from a direct interaction with a ligand, where any amino-acid that is within 6 Å of the bound ligand comprises the ligand-defined binding site. Thus, CPASS uses a comprehensive database comprised of every distinct ligand-binding site present in the RCSB Protein Data Bank (PDB) . The presence of a different ligand, a global sequence similarity less than 90%, or an active site similarity less than 80% correlates with a unique binding site in the CPASS database. As a result, a CPASS search is extremely exhaustive, but time consuming. Conversely, other software that attempt to predict the location of a ligand-binding site typically use reduced definitions of known ligand binding sites, such as a triad of highly conserved residues. These approaches are optimized for speed, but generally identify numerous ambiguous ligand binding sites.
We report here a major upgrade to our CPASS software and database that significantly improves the broad utility of CPASS. The enhancements include more than an order of magnitude reduction in the time required to completely search the CPASS database, an approximate 38% increase in the size of the CPASS database of ligand binding sites, the incorporation of additional terms in our active-site similarity function that further differentiates true positives from false positives, and improvements in the CPASS user interface.
Prior CPASS implementation
As described in an earlier work , the CPASS suite previously utilized a 16-node Beowulf Linux cluster to both store the database and perform the computations. The various components of the suite (user interface (UI), preprocessing, computation, database, post processing) were tightly integrated, as well as non-portable. Additionally, the design was such that only one comparison sweep could be performed at a time, with no mechanism to queue jobs. This model served the purpose during the initial development, but had several inherent limitations for a larger user base.
The non-portable nature of the code (e.g. hard-coded file paths) meant that CPASS was not scalable beyond the original 16-node cluster. The single-user nature and lengthy computation time for a full comparison (~1 day) resulted in severely limited computational throughput. The tight integration of the components also limited flexibility for modification; a change in one area could require altering other components. Finally, there were no mechanisms for fault-tolerance in place. For example, a failure of one compute node, or in one layer of the application stack, could require the comparison to be begun again. Thus, there were several well-defined areas in which improvements could be made.
Current CPASS implementation
The UI layer runs on a single server, and is responsible for the user-facing web portal, pre-processing of the user-supplied data, and display of the results. When a user submits a CPASS comparison through the portal, the UI layer performs the pre-processing and logs the job details. Periodically, the logs are scanned for pending jobs, which are then sent to the Computation layer and logged as active. The Computation layer is monitored for job status, and upon completion, post-processing and logging are performed. The UI layer parses the log files, as well as directly querying the Computation layer, on demand from the user. If the job is complete, the results may be displayed. In the prior CPASS implementation, each individual comparison generated results as static html documents. Result pages are now generated dynamically from plain-text files produced at the Computation layer. Additionally, the current architecture allows multiple concurrent users. Upon submission, each comparison is assigned a unique working directory where all related data is kept. This allows multiple users to submit an arbitrary number of jobs, which are then queued at the Computation layer in the order they are received.
The Computation layer is responsible for executing the core CPASS functionality. A full CPASS sweep is broken into a cluster of several hundred independent jobs prior to submission to the Condor batch system [24, 25], which is used in conjunction with glideinWMS . Condor is a specialized workload management system designed for high-throughput computing. It is responsible for job queuing, scheduling, prioritization, resource monitoring, and resource management. The glideinWMS mechanism simplifies utilization of the hundreds of independent sites that form the Open Science Grid  (OSG). The combination of the two provides a scalable architecture for opportunistic use of Grid compute resources.
A Condor instance is run on the same server the UI layer resides on. The cluster of jobs is submitted to the local Condor instance, which is then distributed using glideinWMS to available Grid compute nodes for execution. Each job is responsible for comparison against a subset of the complete database. Upon completion of each job, the results are transferred back to its local working directory to be displayed by the UI layer. Utilizing the Condor scheduler also provides tolerance against job failure. As each individual job in the cluster is independent, any number may fail at any time without affecting the rest. Condor will detect the failure, and reschedule the job for execution at a different location. With this implementation jobs are run at disparate sites, thus a shared file system is no longer present. Consequently, the relevant database files must be distributed to the remote worker nodes, necessitating the addition of the Data layer.
The Data layer's role is to host the CPASS database, and serve the required files to individual jobs on demand. Given that the database itself consists of a large number of relatively small (approximately 1 MB) files, http-based distribution was chosen. The Data layer consists of a Linux Virtual Server  (LVS) instance, using the Apache HTTP Server for distribution. The LVS is a scalable, high availability (HA) server composed of a cluster of real servers with a Linux-based load balancer. Currently this consists of the required two load balancers (a primary and a backup), and two Apache web servers. As CPASS usage grows, this will place increasing demands on the Data layer to deliver files. The scalable nature of an LVS means that additional web servers may be added transparently to cope with increased demand. Additionally, the HA feature is such that one load balancer, and all but one web server, may fail without interruption to the overall LVS. The Computation layer can continue to operate, although possibly at a reduced level of performance depending on demand.
The current CPASS implementation addresses several core issues identified previously, with significant improvements in flexibility, scalability, and fault-tolerance. Separation of the architecture into distinct layers allows for ease of development. Portability and scalability ensures that as demand grows, additional resources may be utilized more easily. The addition of fault-tolerance in all levels aims to improve the user experience. Most importantly, the ability for high computational throughput will greatly enhance the value of CPASS as an analysis tool.
CPASS similarity function
where active site a contains n residues and is compared to active site b from the CPASS database which contains m residues, p i,j is the BLOSUM62 probability for amino-acid replacement for residue i from active site a with residue j from active site b, ΔRMSD i,j is a corrected root-mean square difference in the Cα coordinate positions between residues i and j, and d min /d i is the ratio of the shortest distance to the ligand among all amino-acids in the active site compared to the current amino-acid's shortest distance to the ligand.
The solvent accessible surface area for each residue in each structure is calculated using the program NACCESS . Specifically, the relative all atom solvent accessible surface area is used. All heteroatoms in the PDB are ignored in the NACCESS calculation, so the solvent accessible surface area corresponds to a ligand-free structure. For structures that contain bound peptides or nucleotides, the peptides or nucleotides were removed from the PDB prior to the NACCESS calculation.
The RMSD between the two ligands is calculated by using the shortest distance to each non-hydrogen atom to the smallest (lowest number of heavy atoms) of the two ligands. It is common for the two ligands in the comparison to be unique chemical entities with a different number of atoms. So, each atom in the smallest structure is used to calculate the RMSD, while all the "extra" atoms in the larger structure are ignored. Effectively, the RMSD is measured from the smaller ligand to an aligned substructure of the larger ligand. The calculated RMSD is then reduced by 0.5 Å to provide a non-penalty region to accommodate for experimental error.
CPASS database update
When CPASS was originally developed, a total of ~34,000 X-ray and NMR structures were available from the RCSB PDB . This led to a CPASS database composed of ~26,000 unique ligand-defined binding sites. A ligand is broadly defined as any small molecular-weight organic compound (co-factors, drugs, metabolites, substrates, etc) or small peptide, DNA or RNA strand consisting of thirteen or less residues. A unique ligand-defined binding site implies that two binding sites that share the same ligand have less than 90% global sequence similarity or less than 80% sequence similarity in the ligand binding site. Common buffers, detergents, salts and other small ligands are removed from the CPASS database. Since the original inception, the RCSB PDB has increased significantly and contains ~68,000 X-ray and NMR structures as of September 2010. This has led to a corresponding increase in the CPASS database, which now comprises ~36,000 unique ligand-defined binding sites. The resulting increase in the CPASS database improves the coverage of functional space and increases the likelihood that a match will be found between a functionally uncharacterized protein and the CPASS database.
CPASS user interface enhancements
Furthermore, the original version of CPASS limited the ligand-defined binding site comparisons to an experimental protein-ligand co-structure uploaded by the user, where CPASS extracted the ligand-defined binding site based on the presence of a ligand in the uploaded structure. CPASS v.2 allows the flexibility of a manually defined ligand binding site, when an experimental protein-ligand co-structure is not available. The user simply provides a standard text file listing the residues in the uploaded protein structure that correspond to the predicted ligand binding site.
Evaluation of CPASS performance
Six different proteins were evaluated using CPASS: glycine hydroxymethyltransferase (PDB: 1 kkp, E.C. 126.96.36.199); aspartate transaminase (PDB: 1yaa, E.C. 188.8.131.52); pyruvate kinase (PDB: 3hqo, E.C. 184.108.40.206); phosphoenolpyruvate carboxykinase (PDB: 1xkv, E.C. 220.127.116.11); glutamine-tRNA ligase (PDB: 1gtr, E.C. 18.104.22.168); and biotin carboxylase (PDB: 1dv2, E.C. 22.214.171.124). The ATP or pyridoxal-5'-phosphate (PLP) ligand binding site from each protein structure was compared against the entire CPASS database of ~36,000 ligand-defined binding sites. Each of these query proteins was submitted using three different CPASS search parameters. A default CPASS search utilizes a ligand-defined binding site from an experimental NMR or X-ray co-structure with the additions to the similarity function of the ligand RMSD, solvent accessible surface area, and the Cβ position within the distance calculation. The two other searches either excluded the ligand RMSD, Cβ in the RMSD calculation, and solvent accessible surface area from the similarity function or used a manually-defined ligand binding site.
Three different methods were used to define what constitutes a functionally similar active site (true positive). The first method only used proteins that were assigned to the same Enzyme Commission (E.C.) classification  (i.e., all four E.C. numbers are identical), where the ligand-defined binding sites contained either the same ligand or a very similar ligand. The second method simply used a broader definition of E.C. similarity (i.e., only the first three E.C. numbers are required to be identical). The third method used a very broad definition of functional homology by defining all active sites in the database that bind the same ligand as being functionally similar. ROC curves were generated using the three different definitions of a true positive. The true positive rates were plotted against false positive rates over the full range of CPASS similarity scores using the different CPASS search parameters and the different definitions of true positives. Similarly, distribution curves plot the fraction of negatives and the fraction of positives at each CPASS similarity score using a bin size of 10. The fraction simply corresponds to the number of positives or negatives per bin relative to the total number of positives of negatives. The area under each curve is 1.
Improvement in CPASS search speed
A notable limitation in the original implementation of CPASS was the significant time required to complete a search against the entire CPASS database. On average, a single comparison took ~40 s, requiring ~24 hrs to complete a search on our 16-node Beowulf Linux cluster. Obviously, the search time increased proportionally with the growth in the RCSB PDB database and the resulting CPASS database. This necessitated strict control over user access to prevent overwhelming our laboratory computer resources. In the recent upgrade, the CPASS code has been optimized, reducing a single comparison to ~7 s, which is greater than 5-fold improvement. CPASS has also been further modified to take advantage of resources available on the Open Science Grid. As a result, the CPASS calculation time has been reduced to less than an hour (including set-up time), more than an order of magnitude improvement. Importantly, this significant reduction in the CPASS search time enabled us to remove any user restrictions to routine access to CPASS. Furthermore, the dramatic improvement in speed is expected to greatly improve the wide-spread utilization of CPASS. CPASS is freely accessible to academic users through our web-site (http://cpass.unl.edu).
CPASS similarity function
The overall philosophy behind the development of the CPASS database and program is the application of experimental ligand-defined binding sites to infer a functional annotation when global sequence and structure similarity is inconclusive. In this manner, CPASS attempts to quantify the structural and sequence similarity between two ligand defined binding sites by spatially overlaying similar residue types. CPASS was primarily designed to compare experimental ligand-defined binding sites. Unfortunately, a protein-ligand co-structure is not always available, but in some cases the identity of the ligand-binding site may be inferred from other sources, such as site-directed mutagenesis, NMR chemical shift perturbations, bioinformatics, or computer modeling.
The new version of CPASS allows for the manual identification of the ligand binding site in addition to the typical extraction of the ligand-binding site from an uploaded protein-ligand PDB file. The manual definition of the ligand-binding site simply requires uploading a standard text file to CPASS. The text file should list the three letter amino acid abbreviation for each residue in the binding site, and the corresponding residue number and chain identifier. The information should exactly match the corresponding residue identifiers in the protein PDB file that is also uploaded to CPASS. The use of a manually-defined ligand-binding site also requires a subtle change in the CPASS similarity function (see eqn. 2), since the structure does not contain a bound ligand. First, the aligned ligand RMSD penalty function is disabled. Second, the ratio of the shortest distance to the ligand (d min ) among all amino-acids in the active site compared to each amino-acid's shortest distance to the ligand (d i ) requires a new reference point since the ligand is not present. The ligand reference point is simply replaced by the center-of-mass for the manually defined binding site. This scaling factor simply reduces the contribution of residues at the 6 Å edge for inclusion in the ligand binding site definition. It diminishes the impact of small structural variations that may result in either the inclusion or exclusion of residues at the 6 Å limit that would correlate to an unjustified large difference in the CPASS similarity score.
CPASS uses a distance-weighted BLOSUM62 scoring function (see eqn. 2) to align and rank ligand-defined binding sites. The alignment ignores sequential connectivity and primarily focuses on the relative spatial orientations of the residues that comprise each binding site. Importantly, the identity or conformation of the ligand is not used in this alignment process. To further improve the ability of the CPASS similarity functions to eliminate dissimilar ligand-binding sites, three additions to the CPASS scoring function have been implemented.
CPASS uses the bound ligands to define the binding sites or functional epitopes, but the ligands are not used in the alignment process. This provides an additional mechanism to evaluate and rank the aligned ligand-binding sites, since the same transformation that was applied to align the binding sites are equally applied to each ligand. Thus, a binding site alignment that also results in a close alignment between the two bound ligands, especially if similar functional groups overlap, increases the likelihood that the two aligned proteins share a common function. The ligand alignment function (ΔRMSD lig ) was empirically designed to gradually apply an increasing penalty as the RMSD between the two ligands increases; reaching a value of zero when the ligands are separated by > 4.5 Å (Figure 2A). The ligand alignment function does not provide a penalty when the aligned ligands are within 0.5 Å. This compensates for the typical experimental error encountered when comparing similar protein structures. A conservative function is applied since the ligand alignment is a global parameter that simply scales the CPASS similarity score. A large penalty based on a poor ligand alignment effectively defines the two ligand-binding sites as dissimilar regardless of how well the ligand-binding sites are aligned. CPASS also provides an option to exclude the ligand alignment function from the overall similarity score.
The original CPASS spatial alignment of ligand binding sites was based on Cα distances. This clearly captures the backbone orientations, which is the primary structural factor that determines ligand-binding site similarities, but it does ignore subtle and potentially important differences in side chain orientations. This issue was reduced by also including Cβ distances in the per residue distance alignment (ΔRMSD i,j ) calculation in the new version of CPASS, requiring a corresponding upgrade to the CPASS database file structure. Of course, only Cα distances are used for alignments involving a glycine. The per-residue distance alignment function (Figure 2B) does not provide a penalty when the aligned residues are within 1 Å. This compensates for the typical experimental error encountered when comparing similar protein structures. But, the function decays rapidly as the RMSD increases beyond 1 Å, where a residue's alignment makes an insignificant contribution to the overall similarity score when the RMSD is greater than 2.5 Å. A relatively harsh per-residue penalty is warranted since the overall similarity score is based on the sum of all the aligned residues. Basically, two ligand binding sites that share an average per residue RMSD of greater than 2.5 Å are not very similar. As a comparison, consider the fact that highly similar protein structures have a global RMSD of less than 2.5 Å .
The per-residue solvent accessible surface area (SASA) captures a distinct physical descriptor that is unique from both the residue identity and the distances between aligned residues and bound ligands. Presumably, the overall characteristics of functionally related ligand binding sites, including SASA, should be preserved. Specifically, a shallow ligand binding cleft on the protein's surface is distinct from a ligand binding pocket formed at the interface of two proteins or domains, or from a deep-binding pocket where a majority of the binding site residues are buried below the protein's surface. Dissimilarity in SASA would further discriminate between these ligand binding sites even if there is a serendipitous spatial overlap in a sub-set of residues. The SASA function (Figure 2C) was empirically designed to emphasize the penalty for large SASA differences (≥60) that primarily distinguishes between surface accessible and buried residues. Similar to the ligand alignment function, the user also has the option to exclude the SASA difference in the CPASS similarity score.
Evaluation of CPASS performance
The definition of a true positive is essential to the evaluation of CPASS performance. Ideally, a measurement of functional similarity would provide the necessary framework to define a true positive, but functional homology is still extremely challenging to quantitate . There are several methods for functional classifications based on sequence similarity (COG , eggNOG , OMA ), structure similarity (CATH , SCOP ), or annotations (Gene Ontology, GO). Unfortunately, there are significant errors associated with each approach. GO terms are generally reliable and are the current "gold standard", but the annotations are often incomplete and overly generic. Functional clustering using sequence similarity may be too coarse (COG), which results in the inclusion of paralogs [39, 44]; or too fine (eggNOG, OMA), which results in multiple clusters with the same function . Of course, there are numerous examples of proteins that share the same function, but exhibit minimal sequence similarity [46, 47]. Alternatively, functional divergence increases significantly as sequence identity drops below 50% . Similar issues arise with structure similarities; there are proteins that exhibit the same function but have different structures, as well as the reverse [49–51]. Thus, the Enzyme Commission (E.C.) number was the best approach to define the functional similarity between the six query proteins and proteins within the CPASS database. E.C. numbers classify proteins based on enzyme-catalyzed reactions, which provides a generally reliable, but limited, mechanism to infer homologous functions.
The impact on CPASS performance by the addition of ligand RMSD, Cβ in the RMSD calculation, and SASA to the similarity function was also evaluated. Similarly, the manual definition of a ligand binding site was compared to the experimental definition of a ligand binding site from an NMR or X-ray structure. The default CPASS approach uses the ligand of a protein-ligand co-structure to define the ligand binding site. As previously discussed, the default definition of a ligand-defined binding site utilizes a scaling function in the CPASS similarity score, where residues further from the ligand contribute less to the overall score (see eqn. 2). Since a manually defined ligand binding site lacks a ligand, the scaling factor uses distances from the center-of-mass for the manually defined binding site. A ROC curve analysis comparing the ligand-defined and manually-defined binding sites for a CPASS calculation with aspartate transaminase shows no significant difference in CPASS performance (data not shown). A similar result was obtained when comparing a CPASS calculation with or without the inclusion of Cβ in the RMSD calculation, ligand RMSD and SASA in the CPASS similarity function. Again, similar results were obtained for all six query proteins.
Similarly, the lack of an apparent improvement in the ROC curves by the inclusion of the ligand RMSD, the Cβ in the RMSD calculation, and SASA in the CPASS similarity function is not surprising given the nearly ideal performance of CPASS seen in figure 7A,B. Instead, the new CPASS similarity function was primarily expected to reduce the similarity score for negatives, while leaving positive scores unaffected. Effectively, the improvements to the CPASS similarity function were anticipated to enhance the differentiation between positives and negatives. A representative distribution of CPASS similarity scores comparing CPASS v.1 and CPASS v.2 is shown in figure 8B. As expected, the distribution of positive scores is basically unchanged. Similar ligand-defined binding sites are expected to have essentially identical side-chain orientations, per residue solvent accessible surface areas, and ligand conformations. Conversely, the CPASS similarity scores decrease for negatives because of an apparent deviation in these structural parameters. This is further illustrated in figure 8C by the sequential decrease in the fraction of negatives with scores above 20% as the new structural features are incrementally added to the CPASS similarity function. A threshold of 20-30% in the CPASS similarity score is typically used to identify potential functional homologs. Thus, the new CPASS similarity function is more efficient at eliminating false positives near this threshold. This is potentially very critical for the analysis of uncharacterized proteins, where a higher confidence in identifying a functional homolog is achieved even with a modest CPASS similarity score (≥ 30%).
The overall goal of the CPASS database and software is to identify similar experimentally-determined ligand binding sites through an exhaustive pair-wise search of the RCSB PDB. CPASS optimizes the spatial orientation of similar amino-acids between two ligand-defined binding sites and ranks the alignment using a collection of sequence and structural empirical functions. We report a series of significant upgrades in CPASS v.2 that includes a dramatic improvement in speed, an expansion in the CPASS database of ligand defined binding sites, and modifications to the CPASS similarity scoring function and user interface.
This work was supported by the National Institute of Allergy and Infectious Diseases [grant number R21AI081154] as well as by a grant from the Nebraska Tobacco Settlement Biomedical Research Development Funds to RP. The research was performed in facilities renovated with support from the National Institutes of Health [grant number RR015468-01]. This work was completed utilizing the Holland Computing Center of the University of Nebraska.
- Powers R, Copeland JC, Germer K, Mercier KA, Ramanathan V, Revesz P: Comparison of Protein Active Site Structures for Functional Annotation of Proteins and Drug Design. Proteins: Struct., Funct., Bioinf. 2006, 65 (1): 124-135. 10.1002/prot.21092.View ArticleGoogle Scholar
- Powers R, Copeland J, Mercier K: Application of FAST-NMR in Drug Discovery. Drug Discov Today. 2008, 13 (3-4): 172-179. 10.1016/j.drudis.2007.11.001.PubMedPubMed CentralView ArticleGoogle Scholar
- Mercier KA, Baran M, Ramanathan V, Revesz P, Xiao R, Montelione GT, Powers R: FAST-NMR: Functional Annotation Screening Technology Using NMR Spectroscopy. J Amer Chem Soc. 2006, 128 (47): 15292-15299. 10.1021/ja0651759.View ArticleGoogle Scholar
- Zvelebil MJ, Barton GJ, Taylor WR, Sternberg MJE: Prediction of protein secondary structure and active sites using the alignment of homologous sequences. J Mol Biol. 1987, 195 (4): 957-961. 10.1016/0022-2836(87)90501-8.PubMedView ArticleGoogle Scholar
- Livingstone CD, Barton GJ: Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation. Comput Appl Biosci. 1993, 9 (6): 745-756.PubMedGoogle Scholar
- Park K, Kim D: Binding similarity network of ligand. Proteins: Struct., Funct., Bioinf. 2008, 71 (2): 960-971. 10.1002/prot.21780.View ArticleGoogle Scholar
- Powers R: Functional Genomics and NMR Spectroscopy. Comb Chem High Throughput Screening. 2007, 10 (8): 676-697. 10.2174/138620707782507331.View ArticleGoogle Scholar
- Mercier Kelly A, Cort John R, Kennedy Michael A, Lockert Erin E, Ni S, Shortridge Matthew D, Powers R: Structure and function of Pseudomonas aeruginosa protein PA1324 (21-170). Protein Sci. 2009, 18 (3): 606-618.PubMedPubMed CentralGoogle Scholar
- Stark JL, Mercier KA, Mueller GA, Acton TB, Xiao R, Montelione GT, Powers R: Solution structure and function of YndB, an AHSA1 protein from Bacillus subtilis. Proteins Struct Funct Bioinf. 2010, 78 (16): 3328-3340. 10.1002/prot.22840.View ArticleGoogle Scholar
- Shortridge MD, Powers R: Structural and Functional Similarity between the Bacterial Type III Secretion System Needle Protein PrgI and the Eukaryotic Apoptosis Bcl-2 Proteins. PLoS ONE. 2009, 4 (10): 1-10. 10.1371/journal.pone.0007442.View ArticleGoogle Scholar
- Kinoshita K, Murakami Y, Nakamura H: eF-seek: prediction of the functional sites of proteins by searching for similar electrostatic potential and molecular surface shape. Nucleic Acids Res. 2007, W398-402. 10.1093/nar/gkm351. 35 Web ServerGoogle Scholar
- Stark A, Sunyaev S, Russell RB: A Model for Statistical Significance of Local Similarities in Structure. J Mol Biol. 2003, 326 (5): 1307-1316. 10.1016/S0022-2836(03)00045-7.PubMedView ArticleGoogle Scholar
- Stark A, Russell RB: Annotation in three dimensions. PINTS: Patterns in non-homologous tertiary structures. Nucleic Acids Res. 2003, 31 (13): 3341-3344. 10.1093/nar/gkg506.PubMedPubMed CentralView ArticleGoogle Scholar
- Laskowski RA, Watson JD, Thornton JM: ProFunc: a server for predicting protein function from 3D structure. Nucleic Acids Res. 2005, W89-93. 10.1093/nar/gki414. 33 Web ServerGoogle Scholar
- Watson James D, Laskowski Roman A, Thornton Janet M: Predicting protein function from sequence and structural data. Curr Opin Struct Biol. 2005, 15 (3): 275-284. 10.1016/j.sbi.2005.04.003.PubMedView ArticleGoogle Scholar
- Pons JL, Labesse G: @TOME-2: a new pipeline for comparative modeling of protein-ligand complexes. Nucleic Acids Res. 2009, W485-W491. 10.1093/nar/gkp368. 37 Web ServerGoogle Scholar
- Wass MN, Kelley LA, Sternberg MJE: 3DLigandSite: predicting ligand-binding sites using similar structures. Nucleic Acids Res. 2010, W469-W473. 10.1093/nar/gkq406. 38 Web ServerGoogle Scholar
- Lopez G, Valencia A, Tress ML: firestar--prediction of functionally important residues using structural templates and alignment reliability. Nucleic Acids Res. 2007, (35 Web Server): W573-577. 10.1093/nar/gkm297.PubMedPubMed CentralView ArticleGoogle Scholar
- Chang DT-H, Chen CY, Chung WC, Oyang YJ, Juan HF, Huang HC: ProteMiner-SSM: A web server for efficient analysis of similar protein tertiary substructures. Nucleic Acids Res. 2004, W76-W82. 10.1093/nar/gkh425. 32 Web ServerGoogle Scholar
- Ausiello G, Via A, Helmer-Citterich M: Query3d: a new method for high-throughput analysis of functional residues in protein structures. BMC Bioinf. 2005, 6 (Suppl. 4): No pp. givenGoogle Scholar
- Gold ND, Jackson RM: A Searchable Database for Comparing Protein-Ligand Binding Sites for the Analysis of Structure-Function Relationships. J Chem Inf Model. 2006, 46 (2): 736-742. 10.1021/ci050359c.PubMedView ArticleGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28 (1): 235-242. 10.1093/nar/28.1.235.PubMedPubMed CentralView ArticleGoogle Scholar
- Huang YJ, Tejero R, Powers R, Montelione GT: A topology-constrained distance network algorithm for protein structure determination from NOESY data. Proteins: Struct., Funct., Bioinf. 2006, 62 (3): 587-603. 10.1002/prot.20820.View ArticleGoogle Scholar
- Litzkow M, Livny M, Mutka M: Condor - A Hunter of Idle Workstations. Proceedings of the 8th International Conference of Distributed Computing Systems: June 13-17, 1988. 1988, San Jose, CA, 104-111.Google Scholar
- Thain D, Tannenbaum T, Livny M: Distributed computing in practice: the Condor experience. Concurrency - Practice and Experience. 2005, 17 (323-356):Google Scholar
- Sfiligoi I: glideinWMS--a generic pilot-based workload management system. J. Phys.: Conf. Ser. 2008, 119 (6): 062044-10.1088/1742-6596/119/6/062044.Google Scholar
- Pordes R: The open science grid. J. Phys.: Conf. Ser. 2007, 78 (1): 012057-10.1088/1742-6596/78/1/012057.Google Scholar
- Zhang W: Linux Virtual Server for Scalable Network Services. Proceedings of the Linux Symposium: July 19-22, 2000. 2000, Ottawa, CaGoogle Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA. 1992, 89 (22): 10915-10919. 10.1073/pnas.89.22.10915.PubMedPubMed CentralView ArticleGoogle Scholar
- Henikoff S, Henikoff JG: Performance evaluation of amino acid substitution matrixes. Proteins: Struct., Funct., Genet. 1993, 17 (1): 49-61. 10.1002/prot.340170108.View ArticleGoogle Scholar
- Hubbard SJ, Thornton JM: NACCESS. 1993, Department of Biochemistry and Molecular Biology, University College London, [http://www.bioinf.manchester.ac.uk/naccess/]Google Scholar
- Muller J, Szklarczyk D, Julien P, Letunic I, Roth A, Kuhn M, Powell S, von MC, Doerks T, Jensen LJ: eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations. Nucleic Acids Res. 2010, D190-D195. 10.1093/nar/gkp951. 38 DatabaseGoogle Scholar
- Triplet T, Shortridge MD, Griep MA, Stark JL, Powers R, Revesz P: PROFESS: a PROtein Function, Evolution, Structure and Sequence database. Database. 2010, 2010: baq011-10.1093/database/baq011.PubMedPubMed CentralView ArticleGoogle Scholar
- Herraez A: Biomolecules in the computer. Jmol to the rescue. Biochem Mol Biol Educ. 2006, 34 (4): 255-261. 10.1002/bmb.2006.494034042644.PubMedView ArticleGoogle Scholar
- Cammer SA: SChiSM: creating interactive Web page annotations of molecular structure models using Chime. Bioinformatics. 2000, 16 (7): 658-659. 10.1093/bioinformatics/16.7.658.PubMedView ArticleGoogle Scholar
- Schomburg I, Chang A, Ebeling C, Gremse M, Heldt C, Huhn G, Schomburg D: BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Res. 2004, D431-D433. 10.1093/nar/gkh081. 32 DatabaseGoogle Scholar
- Kolodny R, Koehl P, Levitt M: Comprehensive Evaluation of Protein Structure Alignment Methods: Scoring by Geometric Measures. J Mol Biol. 2005, 346 (4): 1173-1188. 10.1016/j.jmb.2004.12.032.PubMedPubMed CentralView ArticleGoogle Scholar
- Friedberg I: Automated protein function prediction-the genomic challenge. Briefings Bioinf. 2006, 7 (3): 225-242. 10.1093/bib/bbl004.View ArticleGoogle Scholar
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN: The COG database: an updated version includes eukaryotes. BMC Bioinf. 2003, 4: 41-10.1186/1471-2105-4-41.View ArticleGoogle Scholar
- Muller J, Szklarczyk D, Julien P, Letunic I, Roth A, Kuhn M, Powell S, von MC, Doerks T, Jensen LJ: eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations. Nucleic Acids Res. D190-D195. 38 Database IssGoogle Scholar
- Schneider A, Dessimoz C, Gonnet GH: OMA Browser--exploring orthologous relations across 352 complete genomes. Bioinformatics. 2007, 23 (16): 2180-2182. 10.1093/bioinformatics/btm295.PubMedView ArticleGoogle Scholar
- Greene LH, Lewis TE, Addou S, Cuff A, Dallman T, Dibley M, Redfern O, Pearl F, Nambudiry R, Reid A: The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res. 2007, D291-297. 10.1093/nar/gkl959. 35 DatabaseGoogle Scholar
- Lo Conte L, Ailey B, Hubbard TJ, Brenner SE, Murzin AG, Chothia C: SCOP: a structural classification of proteins database. Nucleic Acids Res. 2000, 28 (1): 257-259. 10.1093/nar/28.1.257.PubMedPubMed CentralView ArticleGoogle Scholar
- Dessimoz C, Boeckmann B, Roth ACJ, Gonnet GH: Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits. Nucleic Acids Res. 2006, 34 (11): 3309-3316. 10.1093/nar/gkl433.PubMedPubMed CentralView ArticleGoogle Scholar
- Gerlt JA, Babbitt PC: Can sequence determine function?. GenomeBiology. 2000, 1 (5): epubGoogle Scholar
- Rost B: Twilight zone of protein sequence alignments. Protein Eng. 1999, 12 (2): 85-94. 10.1093/protein/12.2.85.PubMedView ArticleGoogle Scholar
- Blake JD, Cohen FE: Pairwise Sequence Alignment Below the Twilight Zone. J Mol Biol. 2001, 307 (2): 721-735. 10.1006/jmbi.2001.4495.PubMedView ArticleGoogle Scholar
- Sangar V, Blankenberg DJ, Altman N, Lesk AM: Quantitative sequence-function relationships in proteins based on gene ontology. BMC Bioinf. 2007, 8: 10.1186/1471-2105-8-294. epubGoogle Scholar
- Hegyi H, Gerstein M: The Relationship between Protein Structure and Function: a Comprehensive Survey with Application to the Yeast Genome. J Mol Biol. 1999, 288 (1): 147-164. 10.1006/jmbi.1999.2661.PubMedView ArticleGoogle Scholar
- Todd AE: Deriving functions from structure: Approaches and limitations. The Proteomics Protocols Handbook. 2005, Humana Press Inc, 801-829. full_text.View ArticleGoogle Scholar
- Petrey D, Fischer M, Honig B: Structural relationships among proteins with different global topologies and their implications for function annotation strategies. Proc Natl Acad Sci USA. 2009, 106 (41): 17377-17382. 10.1073/pnas.0907971106.PubMedPubMed CentralView ArticleGoogle Scholar
- Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE: UCSF Chimera-A visualization system for exploratory research and analysis. J Comput Chem. 2004, 25 (13): 1605-1612. 10.1002/jcc.20084.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.