A comprehensive resource for integrating and displaying protein post-translational modifications
© Huang et al; licensee BioMed Central Ltd. 2009
Received: 18 November 2008
Accepted: 23 June 2009
Published: 23 June 2009
Protein Post-Translational Modification (PTM) plays an essential role in cellular control mechanisms that adjust protein physical and chemical properties, folding, conformation, stability and activity, thus also altering protein function.
dbPTM (version 1.0), which was developed previously, aimed on a comprehensive collection of protein post-translational modifications. In this update version (dbPTM2.0), we developed a PTM database towards an expert system of protein post-translational modifications. The database comprehensively collects experimental and predictive protein PTM sites. In addition, dbPTM2.0 was extended to a knowledge base comprising the modified sites, solvent accessibility of substrate, protein secondary and tertiary structures, protein domains, protein intrinsic disorder region, and protein variations. Moreover, this work compiles a benchmark to construct evaluation datasets for computational study to identifying PTM sites, such as phosphorylated sites, glycosylated sites, acetylated sites and methylated sites.
The current release not only provides the sequence-based information, but also annotates the structure-based information for protein post-translational modification. The interface is also designed to facilitate the access to the resource. This effective database is now freely accessible at http://dbPTM.mbc.nctu.edu.tw/.
Protein Post-Translational Modification (PTM) plays a critical role in cellular control mechanism, including phosphorylation for signal transduction, attachment of fatty acids for membrane anchoring and association, glycosylation for changing protein half-life, targeting substrates, and promoting cell-cell and cell-matrix interactions, and acetylation and methylation of histone for gene regulation . Several databases collecting information about protein modifications have been established through high-throughput mass spectrometry in proteomics. UniProtKB/Swiss-Prot  collects many protein modification information with annotation and structure. Phospho.ELM , PhosphoSite  and Phosphorylation Site Database  were developed for accumulating experimentally verified phosphorylation sites. PHOSIDA  integrates thousands of high-confidence in vivo phosphorylation sites identified by mass spectrometry-based proteomics in various species. Phospho 3D  is a database of 3D structures of phosphorylation sites, which stores information retrieved from the phospho.ELM database and is enriched with structural information and annotations at the residue level. O-GLYCBASE  is a database of glycoproteins, most of which include experimentally verified O-linked glycosylation sites. UbiProt  stores experimental ubiquitylated proteins and ubiquitylation sites, which are implicated in protein degradation through an intracellular ATP-dependent proteolytic system. Moreover, the RESID protein modification database is a comprehensive collection of annotations and structures for protein modifications and cross-links, including pre-, co-, and post-translational modifications .
dbPTM  was developed previously to integrate several databases to accumulate known protein modifications, as well as the putative protein modifications predicted by a series of accurately computational tools [12, 13]. This updated version of dbPTM was enhanced to become a knowledge base for protein post-translational modifications, which comprises a variety of new features including the modified sites, solvent accessibility of substrate, protein secondary and tertiary structures, protein domains and protein variations. We also collected literature related to PTM, protein conservations and the specificity of substrate site. Especially for protein phosphorylation, the site-specific interactions between catalytic kinases and substrates are provided. Furthermore, a variety of prediction tools have been developed for more than ten PTM types , such as phosphorylation, glycosylation, acetylation, methylation, sulfation and sumoylation. This work constructed a benchmark data set for computational studies of protein post-translational modification. The benchmark data set can provide a standard for measuring the performance of prediction tools that have been presented for identifying post-translational modification sites of proteins. The web interface of dbPTM is also redesigned and enhanced to facilitate the access to the proposed resource.
Data construction and content
In the part of computational identification of PTMs, KinasePhos-like method [11–13, 17] was applied for identifying 20 types of PTM, which contain at least 30 experimentally verified PTM sites. The detailed processing flow of KinasePhos-like methods is displayed in Figure S1 (See Additional file 1 – Figure S1). The learned models were evaluated using k-fold cross validation. Table S2 (See Additional file 1 – Table S2) lists the predictive performance of these models. To reduce the number of false positive predictions, the predictive parameters were set to ensure a maximal of predictive specificity.
The statistics of experimental PTM sites and putative PTM sites in this study.
No. of experimental sites
No. of putative sites from UniProtKB/Swiss-Prot
No. of HMM-predicted sites in dbPTM
Serine, threonine, tyrosine, and histidine
Asparagine and lysine
Lysine, praline, serine, threonine, and tyrosine
N-terminal of some residues and side chain of lysine or cysteine
Generally at the C-terminal of a mature active peptide after oxidative cleavage of last glycine
Generally of asparagine, aspartate, proline or lysine
Generally of N-terminal phenylalanine, side chain of lysine, arginine, histidine, asparagine or glutamate, and C-terminal cysteine
Pyrrolidone Carboxylic Acid
N-terminal glutamine which has formed an internal cyclic lactam.
C-terminal asparagine, asparate, and serine
Amidated asparagine and glutamine (needs to be followed by a G)
Serine, threonine, and tyrosine
Of the N-terminal methionine
O-8alpha-FAD tyrosine, Pros-8alpha-FAD histidine, S-8alpha-FAD cysteine, and Tele-8alpha-FAD histidine
Utility and major improvements
The enhanced features in this expanding PTM database (dbPTM 2.0).
Previous PTM database
UniProtKB/Swiss-Prot (release 46)
UniProtKB/Swiss-Prot (release 55)
Experimental PTM resource
UniProtKB/Swiss-Prot, Phospho.ELM, and O-GLYCBASE
UniProtKB/Swiss-Prot, Phospho.ELM, PHOSIDA, HPRD, O-GLYCBASE, and UbiProt
Computationally predicted PTMs
Phosphorylation, glycosylation, and sulfation
About 25 types of PTM (phosphorylation, glycosylation, sulfation, acetylation, methylation, sumoylation, hydroxylation, etc.)
Protein Data Bank (PDB)
Protein Data Bank (PDB)
RESID (373 PTM annotations)
RESID (431 PTM annotations)
Structural investigation of PTM sites
Solvent accessibility, secondary structure and intrinsic disorder region
Kinase family annotation
Swiss-Prot and Ensembl
Swiss-Prot and Ensembl
Site-specific PTM literature
Extracting the PTM-related literatures from UniProtKB/Swiss-Prot, Phospho.ELM, HPRD, O-GLYCBASE, and UbiProt
Amino acid frequency, solvent accessibility, secondary structure and disorder region surrounding modified sites
Evolutionary conservation of PTM sites
COG and ClustalW
PTM benchmark set for computational studies
Providing the benchmark for constructing PTM test set to compare the predictive performance of prediction tools
Relationship between PTM and subcellular localization
Analyzing the relationship between PTM and subcellular localization
PTM, solvent accessibility, secondary structure, protein variation, protein domain, and tertiary structure
PTM, solvent accessibility, secondary structure, protein variation, protein domain, tertiary structure, orthologous conserved regions, substrate site specificity and protein interaction network
Structural properties of PTM sites
In order to facilitate the investigation of structural characteristics surrounding the PTM sites, protein tertiary structure obtained from Protein Data Bank  was graphically presented by Jmol program. For proteins with tertiary structures (5% of UniProtKB/Swiss-Prot proteins), the protein structural properties, such as solvent accessibility and secondary structure of residues, were calculated by DSSP . The solvent accessibility of residues and secondary structure of residues for proteins without tertiary structures were predicted by RVP-net  and PSIPRED , respectively. The intrinsic disorder regions were provided using Disopred2 .
Annotation of catalytic kinases of protein phosphorylation sites
In addition to the experimental annotations of catalytic kinases of protein phosphorylation, we applied KinasePhos-like prediction method [11–13, 17] for identifying 20 types of PTM. Figure 2 gives an example that the experimental phosphorylation site S892 of IRS1 was predicted to be catalyzed by protein kinase MAPK and CDK with the preference of proline occurred on position -2 and +1 surrounding the phosphorylation site (position 0). Besides, Y896 is predicted to be catalyzed by kinase IGF1R, the result is consistent with previous investigation . Moreover, S892 is a protein variation site, which was mapped to a non-synonymous single nucleotide polymorphism (SNP), based on the annotation obtained from dbSNP .
Evolutionary conservation of PTM sites
In order to determine whether a PTM sites is conserved among orthologous protein sequences, we integrated the database of Clusters of Orthologous Groups (COGs) , which collected 4873 COGs in 66 unicellular genomes and 4852 clusters of eukaryotic orthologous groups (KOGs) in 7 eukaryotic genomes. ClustalW  program was adopted to implement the alignment of multiple protein sequences in each cluster, and the aligned profile is provided in the resource. An experimentally verified acetyllysine located in a protein-conserved region indicates an evolutionary influence in which orthologous sites in other species could be involved in the same type of PTM (See Additional file 1 – Figure S2). Furthermore, as the example shown in Figure 2, two experimentally verified phosphorylation sites are conserved.
PTM benchmark data set for bioinformatics study
Due to the high-throughput of mass spectrometry in proteomics, the experimental substrate sequences of more than ten PTM types, such as phosphorylation, glycosylation, acetylation, methylation, sulfation and sumoylation, were investigated and used for developing the prediction tools . To understand the predictive performance of these tools previously developed, it is crucial to have a common standard for evaluating the predictive performance among various prediction tools. Therefore, we constructed a benchmark, which comprise the experimental substrate sequences for each PTM type.
The process to compile the evaluation sets is described in Figure S3 (See Additional file 1 – Figure S3), based on criteria developed by Chen et al. . To remove the redundancy, the protein sequences containing the same type of PTM sites are grouped by a threshold of 30% identity by BLASTCLUST . If the identity of two protein sequences is greater than 30%, we re-aligned the fragment sequences of the substrates by BL2SEQ. If the fragment sequences of two substrates with the same location are identical, only one of the substrate was included in the benchmark data set. Therefore, twenty PTM types containing more than 30 experimental sites were complied in the benchmark data set.
Enhanced web interface
A user-friendly web interface is provided for simple searching, browsing, and downloading of protein PTM data. In addition to the database query by the protein name, gene name, UniProtKB/Swiss-Prot ID or accession, it allows the input of protein sequences for similarity search against UniProtKB/Swiss-Prot protein sequences (See Additional file 1 – Figure S4). To provide an overview of PTM types and their modified residues, a summary table is provided for browsing the information and the annotations about the post-translational modification types, which are referred to the UniProtKB/Swiss-Prot PTM list http://www.expasy.org/cgi-bin/lists?ptmlist.txt and RESID .
The proposed server enables both wet-lab biologists and bioinformatics researchers to easily explore the information about protein post-translational modifications. This study not only accumulates the experimentally verified PTM sites with relevant literature references, but also computationally annotates twenty types of PTM sites against UniProtKB/Swiss-Prot proteins. As given in Table 2, the proposed knowledge base provides effective information of protein PTMs, including sequence conservation, subcellular localization and substrate specificity, the average solvent accessibility and the secondary structure surrounding the modified site. Moreover, we construct a PTM benchmark data set that can be adopted for computational studies in evaluating the predictive performance of various tools about determining PTM sites. Previous investigations have indicated that many protein modifications cause binding domains for specific protein-protein interaction to regulate cellular behavior . All the experimental PTM sites and putative PTM sites are available and downloadable in the web interface. Prospective work of dbPTM is to integrate protein-protein interaction data.
Availability and requirements
Project name: dbPTM 2.0: A Knowledge Base for Protein Post-Translational Modifications
ASMD project home page: http://dbPTM.mbc.nctu.edu.tw/
Operating system(s): Platform-independent
Programming Language: PHP, Perl
Restrictions to use by non-academics: None
List of abbreviations
hidden Markov models
Protein Data Bank
single nucleotide polymorphism.
The authors would like to thank the National Science Council of the Republic of China for financially supporting this research under contract No. NSC 95-2311-B-009-004-MY3 and NSC 97-2627-B-009-007. Special thanks for financial support from the National ResearchProgram for Genomic Medicine (NRPGM), Taiwan. This work was also partially supported by MOE ATU. Funding to pay the Open Access publication charges for this article was provided by National Science Council of the Republic of China and MOE ATU.
- Farriol-Mathis N, Garavelli JS, Boeckmann B, Duvaud S, Gasteiger E, Gateau A, Veuthey AL, Bairoch A: Annotation of post-translational modifications in the Swiss-Prot knowledge base. Proteomics. 2004, 4 (6): 1537-1550. 10.1002/pmic.200300764.View ArticlePubMedGoogle Scholar
- Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, et al: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003, 31 (1): 365-370. 10.1093/nar/gkg095.PubMed CentralView ArticlePubMedGoogle Scholar
- Diella F, Gould CM, Chica C, Via A, Gibson TJ: Phospho.ELM: a database of phosphorylation sites–update 2008. Nucleic Acids Res. 2008, D240-244. 36 Database
- Hornbeck PV, Chabra I, Kornhauser JM, Skrzypek E, Zhang B: PhosphoSite: A bioinformatics resource dedicated to physiological protein phosphorylation. Proteomics. 2004, 4 (6): 1551-1561. 10.1002/pmic.200300772.View ArticlePubMedGoogle Scholar
- Wurgler-Murphy SM, King DM, Kennelly PJ: The Phosphorylation Site Database: A guide to the serine-, threonine-, and/or tyrosine-phosphorylated proteins in prokaryotic organisms. Proteomics. 2004, 4 (6): 1562-1570. 10.1002/pmic.200300711.View ArticlePubMedGoogle Scholar
- Gnad F, Ren S, Cox J, Olsen JV, Macek B, Oroshi M, Mann M: PHOSIDA (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites. Genome Biol. 2007, 8 (11): R250-10.1186/gb-2007-8-11-r250.PubMed CentralView ArticlePubMedGoogle Scholar
- Zanzoni A, Ausiello G, Via A, Gherardini PF, Helmer-Citterich M: Phospho3D: a database of three-dimensional structures of protein phosphorylation sites. Nucleic Acids Res. 2007, 35: D229-231. 10.1093/nar/gkl922.PubMed CentralView ArticlePubMedGoogle Scholar
- Gupta R, Birch H, Rapacki K, Brunak S, Hansen JE: O-GLYCBASE version 4.0: a revised database of O-glycosylated proteins. Nucleic Acids Res. 1999, 27 (1): 370-372. 10.1093/nar/27.1.370.PubMed CentralView ArticlePubMedGoogle Scholar
- Chernorudskiy AL, Garcia A, Eremin EV, Shorina AS, Kondratieva EV, Gainullin MR: UbiProt: a database of ubiquitylated proteins. BMC Bioinformatics. 2007, 8: 126-10.1186/1471-2105-8-126.PubMed CentralView ArticlePubMedGoogle Scholar
- Garavelli JS: The RESID Database of Protein Modifications as a resource and annotation tool. Proteomics. 2004, 4 (6): 1527-1533. 10.1002/pmic.200300777.View ArticlePubMedGoogle Scholar
- Lee TY, Huang HD, Hung JH, Huang HY, Yang YS, Wang TH: dbPTM: an information repository of protein post-translational modification. Nucleic Acids Res. 2006, 34: D622-627. 10.1093/nar/gkj083.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang HD, Lee TY, Tzeng SW, Wu LC, Horng JT, Tsou AP, Huang KT: Incorporating hidden Markov models for identifying protein kinase-specific phosphorylation sites. J Comput Chem. 2005, 26 (10): 1032-1041. 10.1002/jcc.20235.View ArticlePubMedGoogle Scholar
- Huang HD, Lee TY, Tzeng SW, Horng JT: KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites. Nucleic Acids Res. 2005, 33: W226-229. 10.1093/nar/gki471.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhou F, Xue Y, Yao X, Xu Y: A general user interface for prediction servers of proteins' post-translational modification sites. Nat Protoc. 2006, 1 (3): 1318-1321. 10.1038/nprot.2006.209.View ArticlePubMedGoogle Scholar
- Diella F, Cameron S, Gemund C, Linding R, Via A, Kuster B, Sicheritz-Ponten T, Blom N, Gibson TJ: Phospho.ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins. BMC Bioinformatics. 2004, 5 (1): 79-10.1186/1471-2105-5-79.PubMed CentralView ArticlePubMedGoogle Scholar
- Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K, Anuradha N, Reddy R, Raghavan TM, et al: Human protein reference database–2006 update. Nucleic Acids Res. 2006, 34: D411-414. 10.1093/nar/gkj141.PubMed CentralView ArticlePubMedGoogle Scholar
- Wong YH, Lee TY, Liang HK, Huang CM, Wang TY, Yang YH, Chu CH, Huang HD, Ko MT, Hwang JK: KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns. Nucleic Acids Res. 2007, 35: W588-594. 10.1093/nar/gkm322.PubMed CentralView ArticlePubMedGoogle Scholar
- Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A: The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004, 23 (5): 464-470. 10.1002/humu.20021.View ArticlePubMedGoogle Scholar
- Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Biswas M, Bradley P, Bork P, Bucher P, et al: InterPro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform. 2002, 3 (3): 225-235. 10.1093/bib/3.3.225.View ArticlePubMedGoogle Scholar
- Deshpande N, Addess KJ, Bluhm WF, Merino-Ott JC, Townsend-Merino W, Zhang Q, Knezevich C, Xie L, Chen L, Feng Z, et al: The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema. Nucleic Acids Res. 2005, 33: D233-237. 10.1093/nar/gki057.PubMed CentralView ArticlePubMedGoogle Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983, 22 (12): 2577-2637. 10.1002/bip.360221211.View ArticlePubMedGoogle Scholar
- Ahmad S, Gromiha MM, Sarai A: RVP-net: online prediction of real valued accessible surface area of proteins from single sequences. Bioinformatics. 2003, 19 (14): 1849-1851. 10.1093/bioinformatics/btg249.View ArticlePubMedGoogle Scholar
- McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction server. Bioinformatics. 2000, 16 (4): 404-405. 10.1093/bioinformatics/16.4.404.View ArticlePubMedGoogle Scholar
- Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT: Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol. 2004, 337 (3): 635-645. 10.1016/j.jmb.2004.02.002.View ArticlePubMedGoogle Scholar
- Gustafson TA, He W, Craparo A, Schaub CD, O'Neill TJ: Phosphotyrosine-dependent interaction of SHC and insulin receptor substrate 1 with the NPEY motif of the insulin receptor via a novel non-SH2 domain. Mol Cell Biol. 1995, 15 (5): 2500-2508.PubMed CentralView ArticlePubMedGoogle Scholar
- Hers I, Bell CJ, Poole AW, Jiang D, Denton RM, Schaefer E, Tavare JM: Reciprocal feedback regulation of insulin receptor and insulin receptor substrate tyrosine phosphorylation by phosphoinositide 3-kinase in primary adipocytes. Biochem J. 2002, 368: 875-884. 10.1042/BJ20020903.PubMed CentralView ArticlePubMedGoogle Scholar
- Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001, 29 (1): 308-311. 10.1093/nar/29.1.308.PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al: The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003, 4: 41-10.1186/1471-2105-4-41.PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22 (22): 4673-4680. 10.1093/nar/22.22.4673.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen H, Xue Y, Huang N, Yao X, Sun Z: MeMo: a web tool for prediction of protein methylation modifications. Nucleic Acids Res. 2006, 34: W249-253. 10.1093/nar/gkl233.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Seet BT, Dikic I, Zhou MM, Pawson T: Reading protein modifications with interaction domains. Nat Rev Mol Cell Biol. 2006, 7 (7): 473-483. 10.1038/nrm1960.View ArticlePubMedGoogle Scholar