Experimental validation of predicted subcellular localizations of human proteins

Chaturvedi, Nagendra K; Mir, Riyaz A; Band, Vimla; Joshi, Shantaram S; Guda, Chittibabu

doi:10.1186/1756-0500-7-912

Short Report
Open access
Published: 15 December 2014

Experimental validation of predicted subcellular localizations of human proteins

Nagendra K Chaturvedi¹,
Riyaz A Mir¹,
Vimla Band^1,2,
Shantaram S Joshi¹ &
…
Chittibabu Guda^1,2,3,4

BMC Research Notes volume 7, Article number: 912 (2014) Cite this article

1986 Accesses
3 Citations
Metrics details

Abstract

Background

Computational methods have been widely used for the prediction of protein subcellular localization. However, these predictions are rarely validated experimentally and as a result remain questionable. Therefore, experimental validation of the predicted localizations is needed to assess the accuracy of predictions so that such methods can be confidently used to annotate the proteins of unknown localization. Previously, we published a method called ngLOC that predicts the localization of proteins targeted to ten different subcellular organelles. In this short report, we describe the accuracy of these predictions using experimental validations.

Findings

We have experimentally validated the predicted subcellular localizations of 114 human proteins corresponding to nine different organelles in normal breast and breast cancer cell lines using live cell imaging/confocal microscopy. Target genes were cloned into expression vectors as GFP fusions and cotransfected with RFP-tagged organelle-specific gene marker into normal breast epithelial and breast cancer cell lines. Subcellular localization of each target protein is confirmed by colocalization with a co-expressed organelle-specific protein marker. Our results showed that about 82.5% of the predicted subcellular localizations coincided with the experimentally validated localizations. The highest agreement was found in the endoplasmic reticulum proteins, while the cytoplasmic location showed the least concordance. With the exclusion of cytoplasmic location, the average prediction accuracy increased to 90.4%. In addition, there was no difference observed in the protein subcellular localization between normal and cancer breast cell lines.

Conclusions

The experimentally validated accuracy of ngLOC method with (82.5%) or without cytoplasmic location (90.4%) nears the prediction accuracy of 89%. These results demonstrate that the ngLOC method can be very useful for large-scale annotation of the unknown subcellular localization of proteins.

Findings

Background

Subcellular localization of proteins to specific compartments is fundamental to the structural organization and functioning of all living cells. Proteins that are localized to unintended organelles have been implicated in the development of many human diseases; therefore, knowledge of the protein subcellular localization can benefit target identification in the drug discovery process [1].

Protein subcellular localization is an important attribute of protein function; thus, prediction of the same aids in genome annotation of high-throughput studies. Numerous computational methods have been used for the prediction of proteins subcellular localization [2]. Among these, some are limited by predicting only a small number of organelles in the cell [3, 4] while some others exhibit lack of a balance between sensitivity and specificity [5, 6]. Previously, we have developed a method called ngLOC, an n-gram based Bayesian method that can predict a wide range of subcellular locations including multiple localizations of proteins [7, 8]. This method makes its predictions solely based on the protein sequence information without the need for any extraneous information; therefore ngLOC is highly favorable for proteome-wide prediction of subcellular localizations.

The ngLOC method predicts subcellular locations at a high overall accuracy of 89%, while the accuracy is much higher (93-96%) in organelles with smaller proteomes such as lysosomes, peroxisomes, Golgi, etc., [7] that are typically difficult to predict due to lack of sufficient size datasets. Although computational predictions provide wealth of information for the subcellular localization of proteins, these predictions remain questionable unless they are validated by experimental methods. In the present study, we report experimental validations for ngLOC predicted subcellular localizations of human proteins. Our results corroborated the predicted results; thus ngLOC method can be used for proteome-wide annotation of protein localizations.

Materials and methods

Reagents and materials

Restriction enzymes and DH5α-competent cells were purchased from New England Biolabs (MA, USA). Trizol™, transfection reagent Lipofectamine2000^TM, and red fluorescent tagged-subcellular markers including Mitotracker™ Red FM, Lysotracker™ Red, ER-Tracker™ Red, BODIPY® TR ceramide, Hoechst 33342 and Alexa Fluor® 594 WGA were obtained from Invitrogen (CA, USA). A cDNA synthesis and ligation kit was purchased from Promega (WI, USA). Primers of all cloned genes for the PCR amplification were obtained from Integrated DNA Technologies Inc. (Coralville, IA). A 2X PCR amplification kit was purchased from Applied Biological Materials Inc. (Richmond, Canada). Plasmid and DNA gel extraction kits were obtained from Qiagen Inc. (Valencia, CA). Fluorodish 35 mm petriplates for live cell imaging were purchased from World Precision Instruments (Sarasota, FL). All plastic wares for mammalian cell culture were purchased from Corning Costar Corp. (NY, USA).

Plasmids and constructs

pEGFP-N1 vector was kindly provided by Dr. Hamid Band (UNMC). Ten GFP-tagged full-length human gene constructs (ngLOC predicted), which include: SACM1L, ST13, TUBAL3, USMG5, DECR2, AMY2B, UXS1, LGMN, NR2F1 and NAPB were obtained from Origene Technologies Inc. (Rockville, MD). Six RFP-tagged subcellular specific human gene constructs (positive markers) which include: endoplasmic reticulum specific ETS, Golgi specific TGOLN, peroxisome specific PXPM2, mitochondria specific PDHA1, plasma membrane specific LCK and cytoskeleton specific β-ACTIN were also purchased from Origene Technologies Inc. (Rockville, MD).

Isolation of RNA and cDNA preparation

Total RNA was extracted from HEK-293 T cells using the TRIzol™ method according to the manufacturer’s instructions. RNA quantity and purity were determined by UV spectrophotometry and by electrophoresis on a 2% agarose gel. Two micrograms of RNA was then reverse transcribed using random hexamer primers and the superscript RT enzyme according to the manufacturer’s instructions (Invitrogen, CA).

PCR amplification and gene cloning

PCR Amplification was achieved with the 2X PCR master mix kit containing Taq DNA polymerase using 30–35 cycles according to the manufacturer’s protocols. For amplification, the two sets of primers with appropriate restriction enzymes were used against full-length ORF of each human gene. The primers used for the genes cloning of this study have been tabulated in Additional file 1. Each PCR amplified gene product was separated on 1% agarose gel in 1X TAE buffer (pH 8.0) and visualized by ethidium bromide staining. The gel extraction of PCR amplified gene products were purified using a gel extraction kit; then these purified genes products were double digested with restriction enzymes using the combination of either NheI/XhoI, NheI/HindIII or BglII/BamHI. Following restriction digestions, the full-length genes were cloned into a pEGFP-N1 vector using a LigaFast ligation kit and were transformed into E. coli (DH5α) bacterial strain. The positive clones of the genes were screened and confirmed, following appropriate restriction digestion.

Cell lines and culture conditions

The normal breast epithelial cell lines MCF-10A and MCF-12 F were obtained from the American Type of Culture Collection (Rockville, MD). These cell lines were maintained in D-media described previously [9]. The breast cancer epithelial cell lines MCF-7 and MDA-MB-231 were kindly provided by Dr. Vimla band (UNMC). These cells were cultured in α-MEM media supplemented with 10% FBS (Invitrogen, CA), 2 mM glutamine (Invitrogen, CA), 50 μg/ml gentamicin (Invitrogen CA), 1x sodium pyruvate (Invitrogen CA), 1x MEM non-essential amino acid (Invitrogen, CA), 1x HEPES (Invitrogen, CA) and 1 μg/ml insulin (Sigma). The cultures were maintained in a humidified incubator adjusted at 5% CO2 and 95% air atmosphere at 37°C. All cultures were passaged twice a week and maintained at a concentration no greater than 1 × 10⁶/ml.

Transient transfections and confocal microscopy

Breast normal (MCF-10A, MCF-12 F) and cancer (MCF-7, MDA-MB-231) epithelial cells were seeded on 35-mm fluorodish petriplates to reach approximately 50-70% confluence in their respective medium. The next day, cells were transiently co-transfected with 1 μg of GFP-tagged predicted target gene and subcellular specific RFP-tagged marker gene (endoplasmic reticulum specific ETS, Golgi specific TGOLN, peroxisome specific PXPM2, mitochondria specific PDHA1, plasma membrane specific LCK and cytoskeleton specific β-ACTIN) for each of the localizations, using Lipofectamine in serum free MEM medium. After 6–8 hours of transfection incubation, cells were supplemented with a complete respective media and given another 12 hours of incubation for the protein expression. Following protein expression, subcellular distribution and co-localization of proteins were assessed under the confocal microscope. Alternatively, other red fluorescent subcellular specific markers (dye) were also used with live cells to validate the each localization. Each predicted localization was confirmed and validated when the co-localization produces a yellow color upon merging the images of specific subcellular markers. Nuclear stain Hoechst-33342 (1 μg/ml) was added to live cells for the visualization of nucleus. Fluorescence images of live cells were recorded through Zeiss LSM 710 confocal microscope (Jena, Germany) with 40X objective lens. Images were captured and analyzed with LSM software (Jena, Germany) and processed using standard software programs.

Results and discussion

The research strategy used for experimental validation of ngLOC predicted protein subcellular localizations is described in Figure 1. cDNA was synthesized from HEK-293 T cells; with the use of cDNA, the genes of 105 target proteins of human origin were PCR amplified and then cloned into a GFP expression vector (pEGFP-N1) with GFP at the N-terminus as a fusion gene. Using the ngLOC method, 114 target proteins with predicted subcellular localization (includes 105 locally cloned and nine commercially obtained) were selected for this validation study (Additional file 2). GFP expressing fusion genes along with corresponding location-specific RFP-tagged protein markers, were transiently co-expressed following gene transfection into two normal breast (MCF-10A, MCF-12 F) and two breast cancer (MCF-7, MDA-231) cell lines; then their subcellular localization was determined using live cell imaging/confocal microscopy. In the present study, nine different subcellular compartments were selected for validating the predicted subcellular localization of proteins. The images in Figure 2 show a representation of validated localizations for predicted proteins in each compartment. The localizations for each compartment (except for nucleus and cytoplasm) were determined by observing the colocalization of GFP- and RFP-tagged proteins, which produced a yellow color upon merging the images. For the nucleus and cytoplasm, we used a nuclear (Hoechst) stain to validate the protein subcellular localization in either location (Figure 2).

Table 1 lists the prediction for each gene tested, along with the outcome of the validation experiment. Similarly, Figure 3 shows the number of tested and succeeded proteins in the validation experiments. Our live cell imaging results showed that overall about 82.5% (94 out of 114) of proteins validated in this study agreed with the ngLOC predicted localizations; these results were consistent in all four cell lines tested. However, with the exclusion of the cytoplasm location that shows the lowest accuracy (45%), the average prediction rate increases to 90.4% (85 out of 94). ngLOC method outputs the predictions in a ranked order by using the associated confidence score (probability) for each location. The top two locations can be predicted within a close confidence range, suggesting that either or both of the predictions can be true. It is known that a number of proteins are localized to multiple organelles in eukaryotic cells (7). To test the accuracy of the second choice we also validated the second predictions of 30 proteins, which included 17 proteins (Set I) whose first choice predictions were proven wrong and 13 proteins (Set II) whose first choice predictions were accurate in the above experiments. From Set I, 10 proteins have shown homogenous distribution in cells, suggesting their localization both in the cytoplasm and nucleus (Table 2). For seven of these 10 proteins, the top two ngLOC predictions were cytoplasm or nucleus, which support our results that these proteins are localized in both nucleus and cytoplasm. From the other 7 proteins in Set I, the second choice predictions were validated as correct only for 2 proteins (Table 2). Validation results on Set II showed that about 46% (6 out of 13) of the proteins tested have also agreed with the second prediction (Table 2), indicating that these proteins are dual localized. With the inclusion of the second prediction validations, we have experimentally validated the subcellular localization of 144 ngLOC predictions.

Table 1 Experimental validation for ngLOC predicted proteins subcellular localization

Full size table

Table 2 Experimental validation of ngLOC top second predicted proteins subcellular localization

Full size table

We also looked into the correlation between the confidence score (CS) and prediction accuracy for ngLOC predictions. CS is expressed as percentage and the value can range from zero to 100. ngLOC method uses a minimum CS of 20 to make predictions (7), however we chose only a small subset of predicted proteins for validation. The CS for validated proteins ranges from 20 to 73 in this study. We divided the total number of validated proteins into two groups, low CS group (CS <46) and high CS group (CS >46); where, CS of 46 is the midpoint of the CS range for the proteins validated. Our validation results showed that 88% (50 out of 57) of the low CS group proteins were predicted accurately, compared to that of the high CS group proteins, which was 77% (44 out of 57). While these results are counter-intuitive, the high CS group contains a number of proteins that are predicted to be localized to cytoplasm, which has the highest false positive rate. Without counting the cytoplasmic proteins, the accuracies would be 92% for low CS group and 89% for the high CS group. These results demonstrate that there is no significant correlation between the CS and prediction accuracy. We presume that the lack of correlation is due to the unbalanced selection of validated proteins from a narrow range of confidence scores (see Additional file 3: Table S1), which in turn is due to feasibility (project costs limiting the sample size) and technical (PCR amplification of longer genes) issues that limited our ability to select proteins from a wider CS range for validation.Despite the lower CS range predictions for proteins localized to ER (33-65%), lysosomal (38-50%) and peroxisomal (23-39%), the validation accuracy is 100% at these locations. Similarly, plasma membrane, cytoskeletal, mitochondrial, Golgi and nuclear proteins recorded about 85% accuracy (Figure 3). Conversely, cytoplasmic proteins scored the lowest with only 45% prediction accuracy. The high false positives in this location can be attributed to the fact that cytoplasm location, being the default location for protein synthesis, lacks specific targeting signals that makes it difficult to predict. Another reason could be the dual- or multi-localization of about one-third of cytoplasmic proteins to other locations (7); where, the machine learning methods face difficulty in discriminating the cytoplasmic proteins compared to those from other locations.Overall, the experimental validations in this study prove that the ngLOC method can predict the subcellular localization of proteins at an accuracy of 82.5%, contrary to the reported accuracy of 89% (7). However, with the exclusion of the low performing cytoplasmic location (45%), the average accuracy rate jumped to 90.4% (85 out of 94). As shown in Figure 3, the accuracy is especially notable for the locations with smaller proteomes (ER, Golgi, Lysosome and Peroxisome), which are typically difficult to predict by machine learning methods. These results demonstrate the robustness, accuracy, and application in annotating the unknown subcellular localization of proteomes of eukaryotic species using the ngLOC method.

Conclusion

This study experimentally validates and reports the accuracy of a computational method called ngLOC that predicts the subcellular localization of protein sequences in eukaryotic cells. We validated 114 human proteins that were predicted to be localized to nine distinct subcellular locations in eukaryotic cells. The overall validation accuracy rate of ngLOC method is at 82.5%, while the rate improved to 90.4% just by excluding the cytoplasmic location, compared to the overall prediction accuracy of 89%. Thus, this validation study demonstrates that ngLOC can be reliably used (with the exception of cytoplasmic location) to annotate the subcellular localization of proteins and affirms the utility of this method in large-scale annotation of newly sequenced proteomes.

Abbreviations

GFP:: Green fluorescent protein
RFP:: Red fluorescent protein
PLA:: Plasma membrane
CSK:: Cytoskeleton
CYT:: Cytoplasm
END:: Endoplasmic reticulum
GOL:: Golgi complex
MIT:: Mitochondria
LYS:: Lysosome
POX:: Peroxisome
NUC:: Nucleus
UNMC:: University of Nebraska Medical Center.

References

Donnes P, Hoglund A: Predicting protein subcellular localization: past, present, and future. Genomics Proteomics Bioinformatics. 2004, 2: 209-215.
PubMed Google Scholar
Sprenger J, Fink JL, Teasdale RD: Evaluation and comparison of mammalian subcellular localization prediction methods. BMC Bioinformatics. 2006, 7 (Suppl 5): S3-10.1186/1471-2105-7-S5-S3.
Article PubMed PubMed Central Google Scholar
Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol. 2000, 300: 1005-1016. 10.1006/jmbi.2000.3903.
Article PubMed CAS Google Scholar
Guda C, Fahy E, Subramaniam S: MITOPRED: a genome-scale method for prediction of nucleus-encoded mitochondrial proteins. Bioinformatics. 2004, 20: 1785-1794. 10.1093/bioinformatics/bth171.
Article PubMed CAS Google Scholar
Nakai K, Kanehisa M: A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics. 1992, 14: 897-911. 10.1016/S0888-7543(05)80111-9.
Article PubMed CAS Google Scholar
Park KJ, Kanehisa M: Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics. 2003, 19: 1656-1663. 10.1093/bioinformatics/btg222.
Article PubMed CAS Google Scholar
King BR, Guda C: ngLOC: an n-gram-based Bayesian method for estimating the subcellular proteomes of eukaryotes. Genome Biol. 2007, 8: R68-10.1186/gb-2007-8-5-r68.
Article PubMed PubMed Central Google Scholar
King BR, Vural S, Pandey S, Barteau A, Guda C: ngLOC: software and web server for predicting protein subcellular localization in prokaryotes and eukaryotes. BMC Research Notes. 2012, 5: 351-10.1186/1756-0500-5-351.
Article PubMed PubMed Central Google Scholar
Band V, Zajchowski D, Kulesa V, Sager R: Human papilloma virus DNAs immortalize normal human mammary epithelial cells and reduce their growth factor requirements. Proc Natl Acad Sci U S A. 1990, 87: 463-467. 10.1073/pnas.87.1.463.
Article PubMed CAS PubMed Central Google Scholar

Download references

Acknowledgements

This research is fully supported by National Institutes of Health [1R01GM086533-01A1 to CG]. The authors thank the confocal core and the bioinformatics and systems biology core at UNMC for their help in this study. The authors also thank Dr. Hamid Band (UNMC) for providing the pEGFP-N1 vector for our cloning experiments, and Mrs. Megan Brown for proofreading the manuscript.

Author information

Authors and Affiliations

Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, 985870 Nebraska Medical Center, Omaha, NE, 68198, USA
Nagendra K Chaturvedi, Riyaz A Mir, Vimla Band, Shantaram S Joshi & Chittibabu Guda
Fred and Pamela Buffet Cancer Center, Omaha, USA
Vimla Band & Chittibabu Guda
Eppley Institute for Cancer Research, Omaha, USA
Chittibabu Guda
Bioinformatics and Systems Biology Core, University of Nebraska Medical Center, Omaha, NE, 68198-5805, USA
Chittibabu Guda

Authors

Nagendra K Chaturvedi
View author publications
You can also search for this author in PubMed Google Scholar
Riyaz A Mir
View author publications
You can also search for this author in PubMed Google Scholar
Vimla Band
View author publications
You can also search for this author in PubMed Google Scholar
Shantaram S Joshi
View author publications
You can also search for this author in PubMed Google Scholar
Chittibabu Guda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chittibabu Guda.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

NKC and CG conceived and designed the research study. NKC, RAM, and CG performed the experiments. NKC and CG analyzed the data, wrote the paper, and revised the paper critically. RAM, VB and SSJ revised the paper and contributed reagents and materials. All authors read and approved the final manuscript.

Electronic supplementary material

Additional file 1:Primer list with the restriction sites used for gene cloning.(DOCX 32 KB)

Additional file 2:Predictions by ngLOC method for proteins without a known subcellular localization.(XLSX 88 KB)

13104_2014_3411_MOESM3_ESM.docx

Additional file 3: Table S1: Statistics showing the spread and range of confidence scores (CS) in the predicted and validated proteins in each subcellular location. (DOCX 60 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Chaturvedi, N.K., Mir, R.A., Band, V. et al. Experimental validation of predicted subcellular localizations of human proteins. BMC Res Notes 7, 912 (2014). https://doi.org/10.1186/1756-0500-7-912

Download citation

Received: 01 October 2014
Accepted: 10 December 2014
Published: 15 December 2014
DOI: https://doi.org/10.1186/1756-0500-7-912

Experimental validation of predicted subcellular localizations of human proteins

Abstract

Background

Findings

Conclusions

Findings

Background

Materials and methods

Reagents and materials

Plasmids and constructs

Isolation of RNA and cDNA preparation

PCR amplification and gene cloning

Cell lines and culture conditions

Transient transfections and confocal microscopy

Results and discussion

Conclusion

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

Additional file 1:Primer list with the restriction sites used for gene cloning.(DOCX 32 KB)

Additional file 2:Predictions by ngLOC method for proteins without a known subcellular localization.(XLSX 88 KB)

13104_2014_3411_MOESM3_ESM.docx

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

About this article

Cite this article

Keywords

BMC Research Notes

Contact us

Experimental validation of predicted subcellular localizations of human proteins

Abstract

Background

Findings

Conclusions

Findings

Background

Materials and methods

Reagents and materials

Plasmids and constructs

Isolation of RNA and cDNA preparation

PCR amplification and gene cloning

Cell lines and culture conditions

Transient transfections and confocal microscopy

Results and discussion

Conclusion

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

Additional file 1:Primer list with the restriction sites used for gene cloning.(DOCX 32 KB)

Additional file 2:Predictions by ngLOC method for proteins without a known subcellular localization.(XLSX 88 KB)

13104_2014_3411_MOESM3_ESM.docx

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Research Notes

Contact us