A comprehensive resource for integrating and displaying protein post-translational modifications

Background Protein Post-Translational Modification (PTM) plays an essential role in cellular control mechanisms that adjust protein physical and chemical properties, folding, conformation, stability and activity, thus also altering protein function. Findings dbPTM (version 1.0), which was developed previously, aimed on a comprehensive collection of protein post-translational modifications. In this update version (dbPTM2.0), we developed a PTM database towards an expert system of protein post-translational modifications. The database comprehensively collects experimental and predictive protein PTM sites. In addition, dbPTM2.0 was extended to a knowledge base comprising the modified sites, solvent accessibility of substrate, protein secondary and tertiary structures, protein domains, protein intrinsic disorder region, and protein variations. Moreover, this work compiles a benchmark to construct evaluation datasets for computational study to identifying PTM sites, such as phosphorylated sites, glycosylated sites, acetylated sites and methylated sites. Conclusion The current release not only provides the sequence-based information, but also annotates the structure-based information for protein post-translational modification. The interface is also designed to facilitate the access to the resource. This effective database is now freely accessible at .


Background
Protein Post-Translational Modification (PTM) plays a critical role in cellular control mechanism, including phosphorylation for signal transduction, attachment of fatty acids for membrane anchoring and association, glycosylation for changing protein half-life, targeting sub-strates, and promoting cell-cell and cell-matrix interactions, and acetylation and methylation of histone for gene regulation [1]. Several databases collecting information about protein modifications have been established through high-throughput mass spectrometry in proteomics. UniProtKB/Swiss-Prot [2] collects many pro-tein modification information with annotation and structure. Phospho.ELM [3], PhosphoSite [4] and Phosphorylation Site Database [5] were developed for accumulating experimentally verified phosphorylation sites. PHOSIDA [6] integrates thousands of high-confidence in vivo phosphorylation sites identified by mass spectrometry-based proteomics in various species. Phospho 3D [7] is a database of 3D structures of phosphorylation sites, which stores information retrieved from the phospho.ELM database and is enriched with structural information and annotations at the residue level. O-GLY-CBASE [8] is a database of glycoproteins, most of which include experimentally verified O-linked glycosylation sites. UbiProt [9] stores experimental ubiquitylated proteins and ubiquitylation sites, which are implicated in protein degradation through an intracellular ATP-dependent proteolytic system. Moreover, the RESID protein modification database is a comprehensive collection of annotations and structures for protein modifications and cross-links, including pre-, co-, and post-translational modifications [10]. dbPTM [11] was developed previously to integrate several databases to accumulate known protein modifications, as well as the putative protein modifications predicted by a series of accurately computational tools [12,13]. This updated version of dbPTM was enhanced to become a knowledge base for protein post-translational modifications, which comprises a variety of new features including the modified sites, solvent accessibility of substrate, protein secondary and tertiary structures, protein domains and protein variations. We also collected literature related to PTM, protein conservations and the specificity of substrate site. Especially for protein phosphorylation, the sitespecific interactions between catalytic kinases and substrates are provided. Furthermore, a variety of prediction tools have been developed for more than ten PTM types [14], such as phosphorylation, glycosylation, acetylation, methylation, sulfation and sumoylation. This work con- The system architecture of the knowledge base for protein translational modification Figure 1 The system architecture of the knowledge base for protein translational modification. It comprises the three major components: integration of external experimental PTM databases, learning and prediction of 20 types of PTM, and annotations of PTM knowledge (more details in the text). structed a benchmark data set for computational studies of protein post-translational modification. The benchmark data set can provide a standard for measuring the performance of prediction tools that have been presented for identifying post-translational modification sites of proteins. The web interface of dbPTM is also redesigned and enhanced to facilitate the access to the proposed resource.

Data construction and content
As shown in Figure 1, the system architecture of dbPTM2.0 database comprises three major components: the integration of external PTM databases, the computational identification of PTMs, and the structural and functional annotations of PTMs. We integrated five PTM databases, including UniProtKB/Swiss-Prot (release 55.0) [1], Phospho.ELM (version 7.0) [15], O-GLYCBASE (version 6.0) [8], UbiProt (version 1.0) [9] and PHOSIDA (version 1.0) [6] for obtaining experimental protein modifications. The description and data statistics of these databases are briefly given in Table S1 (see Additional file 1 -Table S1). Additionally, Human Protein Reference Database (HPRD) [16], which compiles invaluable information relevant to functions and PTMs of human proteins in health and disease, was also integrated.
In the part of computational identification of PTMs, KinasePhos-like method [11][12][13]17] was applied for identifying 20 types of PTM, which contain at least 30 experimentally verified PTM sites. The detailed processing flow of KinasePhos-like methods is displayed in Figure S1 (See Additional file 1 - Figure S1). The learned models were evaluated using k-fold cross validation. Table S2 (See  Additional file 1 -Table S2) lists the predictive performance of these models. To reduce the number of false posi-tive predictions, the predictive parameters were set to ensure a maximal of predictive specificity.
The statistics of the experimental PTM sites and putative PTM sites in this integral PTM database is given in Table 1.
After removing the redundant PTM sites among six databases, there are totally 45833 experimental PTM sites in this update version. All experimental PTM sites are further categorized by PTM types. For instance, there are 31, 363 experimental phosphorylation sites and 2,080 experimental acetylation sites in the database. In addition to the experimental PTM sites, UniProtKB/Swiss-Prot provides putative PTM sites by using sequence similarity or evolutionary potential. Moreover, KinasePhos-like methods [11][12][13]17] were adopted to construct the profile hidden Markov models (HMMs) for twenty types of PTMs. These models were applied to identify the potential PTM sites against protein sequences obtained from UniProtKB/ Swiss-Prot. As given in Table 1, 2,560,047 sites for all PTM types were identified. The structural and functional annotations of protein modifications were obtained from Uni-ProtKB/Swiss-Prot [18], InterPro [19], Protein Data Bank [20] and RESID [10] (See Additional file 1 - Table S3).

Utility and major improvements
In order to provide more effective information about protein modifications in this update version, we extended dbPTM to a knowledge base containing structural properties for PTM sites, PTM related literature, evolutionary conservation of PTM sites, subcellular localization of modified proteins and the benchmark set for computational studies.  to 431 PTM types. To detect the potential PTM sites in UniProtKB/Swiss-Prot proteins without any PTM annotations, the KinasePhos-like method was applied to 20 PTM types. Especially in protein phosphorylation, more than 60 kinase-specific prediction models were constructed and applied to identify the phosphorylation sites with catalytic kinases.

Structural properties of PTM sites
In order to facilitate the investigation of structural characteristics surrounding the PTM sites, protein tertiary structure obtained from Protein Data Bank [20] was graphically presented by Jmol program. For proteins with tertiary structures (5% of UniProtKB/Swiss-Prot proteins), the protein structural properties, such as solvent accessibility and secondary structure of residues, were calculated by DSSP [21]. The solvent accessibility of residues and secondary structure of residues for proteins without tertiary structures were predicted by RVP-net [22] and PSIPRED [23], respectively. The intrinsic disorder regions were provided using Disopred2 [24].

Annotation of catalytic kinases of protein phosphorylation sites
In addition to the experimental annotations of catalytic kinases of protein phosphorylation, we applied Kinase-Phos-like prediction method [11][12][13]17] for identifying 20 types of PTM. Figure 2 gives an example that the experi-mental phosphorylation site S892 of IRS1 was predicted to be catalyzed by protein kinase MAPK and CDK with the preference of proline occurred on position -2 and +1 surrounding the phosphorylation site (position 0). Besides, Y896 is predicted to be catalyzed by kinase IGF1R, the result is consistent with previous investigation [26]. Moreover, S892 is a protein variation site, which was mapped to a non-synonymous single nucleotide polymorphism (SNP), based on the annotation obtained from dbSNP [27].

Evolutionary conservation of PTM sites
In order to determine whether a PTM sites is conserved among orthologous protein sequences, we integrated the database of Clusters of Orthologous Groups (COGs) [28], which collected 4873 COGs in 66 unicellular genomes and 4852 clusters of eukaryotic orthologous groups (KOGs) in 7 eukaryotic genomes. ClustalW [29] program A part of result page on the web interface Figure 2 A part of result page on the web interface. An example of graphical presentation of PTM sites and the structural characteristics of human protein IRS1. was adopted to implement the alignment of multiple protein sequences in each cluster, and the aligned profile is provided in the resource. An experimentally verified acetyllysine located in a protein-conserved region indicates an evolutionary influence in which orthologous sites in other species could be involved in the same type of PTM (See Additional file 1 - Figure S2). Furthermore, as the example shown in Figure 2, two experimentally verified phosphorylation sites are conserved.

PTM benchmark data set for bioinformatics study
Due to the high-throughput of mass spectrometry in proteomics, the experimental substrate sequences of more than ten PTM types, such as phosphorylation, glycosylation, acetylation, methylation, sulfation and sumoylation, were investigated and used for developing the prediction tools [14]. To understand the predictive performance of these tools previously developed, it is crucial to have a common standard for evaluating the predictive performance among various prediction tools. Therefore, we constructed a benchmark, which comprise the experimental substrate sequences for each PTM type.
The process to compile the evaluation sets is described in Figure S3 (See Additional file 1 - Figure S3), based on criteria developed by Chen et al. [30]. To remove the redundancy, the protein sequences containing the same type of PTM sites are grouped by a threshold of 30% identity by BLASTCLUST [31]. If the identity of two protein sequences is greater than 30%, we re-aligned the fragment sequences of the substrates by BL2SEQ. If the fragment sequences of two substrates with the same location are identical, only one of the substrate was included in the benchmark data set. Therefore, twenty PTM types containing more than 30 experimental sites were complied in the benchmark data set.
An illustrative example to show the catalytic specificity of acetyllysine Figure 3 An illustrative example to show the catalytic specificity of acetyllysine.

Enhanced web interface
A user-friendly web interface is provided for simple searching, browsing, and downloading of protein PTM data. In addition to the database query by the protein name, gene name, UniProtKB/Swiss-Prot ID or accession, it allows the input of protein sequences for similarity search against UniProtKB/Swiss-Prot protein sequences (See Additional file 1 - Figure S4). To provide an overview of PTM types and their modified residues, a summary table is provided for browsing the information and the annotations about the post-translational modification types, which are referred to the UniProtKB/Swiss-Prot PTM list http://www.expasy.org/cgi-bin/lists?ptmlist.txt and RESID [10]. Figure 3 shows an example that users can choose the acetylation of lysine (K) to obtain more detailed information such as the position of modified amino acid, the location of the modification in protein sequence, the modified chemical formula, the mass difference, and the substrate site specificity, which is the preference of amino acids surrounding the modification sites. Furthermore, the structural information, such as solvent accessibility and secondary structure surrounding the modified sites, are provided. All the experimental PTM sites and putative PTM sites can be downloaded from the web interface.

Conclusion
The proposed server enables both wet-lab biologists and bioinformatics researchers to easily explore the information about protein post-translational modifications. This study not only accumulates the experimentally verified PTM sites with relevant literature references, but also computationally annotates twenty types of PTM sites against UniProtKB/Swiss-Prot proteins. As given in Table 2, the proposed knowledge base provides effective information of protein PTMs, including sequence conservation, subcellular localization and substrate specificity, the average solvent accessibility and the secondary structure surrounding the modified site. Moreover, we construct a PTM benchmark data set that can be adopted for computational studies in evaluating the predictive performance of various tools about determining PTM sites. Previous investigations have indicated that many protein modifications cause binding domains for specific protein-protein interaction to regulate cellular behavior [32]. All the experimental PTM sites and putative PTM sites are available and downloadable in the web interface. Prospective work of dbPTM is to integrate protein-protein interaction data.

Additional file 1
Supplementary figures (S1, S2, S3, and S4) and tables (S1, S2, and S3). The data provided 4 figures and 3 tables. The description of each figures and tables are given below. Figure S1. The detailed processing flow of KinasePhos-like methods. Figure S2. The multiple sequence alignment of orthologous conserved regions. Figure S3. The flowchart to remove data redundance. Figure S4. Example of search web pages. Table S1. Data statistics of the integrated resources. Table S2. The parameters and predictive performance of the trained models with best accuracy for each PTM type.