A program to identify prognostic and predictive gene signatures
© Chorlton et al.; licensee BioMed Central Ltd. 2014
Received: 9 June 2014
Accepted: 24 July 2014
Published: 18 August 2014
The advent of high-throughput technologies to profile human tumors has generated unprecedented insight into our molecular understanding of cancer. However, analysis of such high dimensional data is challenging and requires significant expertise which is not routinely available to many cancer researchers.
To overcome this limitation, we developed a freely accessible and user friendly Program to Identify Molecular Signatures (PIMS). Importantly, such signatures allow important insight into cancer biology, as well as provide clinical tools to identify potential biomarkers that might provide means to accurately stratify patients into different risk or treatment groups. We evaluated the performance of PIMS by identifying and testing predictive and prognostic gene signatures for breast cancer, using multiple breast tumor microarray cohorts representing hundreds of patients. Importantly, PIMS identified signatures classified patients into high and low risk groups with at least similar performance to other commonly used gene signature selection techniques.
Our program is contained entirely within a Microsoft Excel file and therefore requires no installation of any additional programs or training. Hence, PIMS provides an accessible tool for cancer researchers to identify predictive and prognostic gene signatures to advance their research.
Cancer oncologists are faced with the challenging task of predicting which patients are most likely to benefit from various treatment modalities, as well as avoid overtreating patients who are unlikely to benefit from aggressive therapy. For example, in breast cancer the traditional parameters used by pathologists to determine patient prognosis include age, tumor size, as well as various histopathological measurements such as clinical grade and hormone receptor status [1, 2]. More recently, the development of gene expression profiling technologies such as microarrays and quantitative RT-PCR have led to the use of molecular signatures as additional means for providing prognostic information for breast cancer patients [3–15]. Indeed, multigene predictors, which are also commonly called gene signatures, are already being used clinically in some instances, such as the MammaPrint® and OncotypeDX™ tests. Apart from breast cancer, gene signatures have also been applied to other cancer types to determine patient prognosis and other clinical parameters of interest [16, 17]. Additionally, examination of transcripts that comprise gene signatures can reveal biological processes which underlie clinical phenomena, and potentially uncover new therapeutic avenues. Hence, gene signatures provide an important tool to advance clinical as well as basic cancer research. However, identifying predictive or prognostic gene signatures requires the use of specialized software and bioinformatics training, which ultimately hampers their adoption where such infrastructure or skills are lacking.
We hypothesized that an Excel program, which identified predictive and prognostic gene signatures and did not require the installation or use of any other software packages, would increase the accessibility of this type of research. To this end, we adapted and improved an algorithm we previously published  into a freely accessible and user-friendly Excel program: Program to Identify Molecular Signatures (PIMS). Here, we demonstrate its use to identify prognostic gene signatures, which stratify breast cancer patients into high and low risk groups, as well as predictive gene signatures, which stratify breast cancer patients into chemotherapy responsive and non-responsive groups. These findings suggest that our program is robust and can be used to develop predictive and prognostic gene signatures for user defined contexts. Hence, we conclude that PIMS provides an accessible tool for cancer researchers to identify predictive and prognostic gene signatures to advance their research aims.
Microarray and clinical data
All data was obtained de-identified and obtained from publically available sources through the gene expression omnibus. We downloaded the following datasets as well as associated clinical data from the gene expression omnibus (GSE2034 [n = 286] , GSE7390 [n = 198] , GSE25055 [n = 310] , GSE25065 [n = 1] ), and GSE14333(n = 290). All datasets were normalized using RMA  using the public gene pattern server (http://genepattern.broadinstitute.org).
Summary of training and validation cohorts used for prognostic signature
Survival at 10 yrs
Total arrays: 484
Summary of training and validation cohorts used for predictive signature
RCB0/I vs RCBII/III
Total arrays: 508
Feature selection algorithm
We significantly improved a previously published feature selection algorithm  by adding leave-one-out cross-validation as well as improved means of calculating signature scores, to produce software capable of identifying prognostic/predictive gene signatures. Initially, gene expression for all patients is standardized across each probe set, such that the mean and standard deviation of each probe set is set to 0 and 1 respectively. Gene expression is then binned into the categories high, typical, and low based on the 95% confidence interval of expression for a given gene. For example, high gene expression indicates that the expression of a gene exceeds the 95% confidence interval of expression for that gene among all patients, and low expression indicates that the expression of a gene was less than the 95% confidence interval of that gene among all patients. Genes with expression within the 95% confidence interval of expression were considered to have typical expression. A predictive score (initially set at 0) for each probe set/gene is then calculated in the following way (Additional file 1: Figure S1):
Patients who had the event and have high expression of a gene increase the predictive score of that gene by 1.
Patients who had the event and have low expression of a gene decrease the predictive score of that gene by 1
Patients who did not have the event and have high expression of a gene decrease the predictive score of that gene by 1.
Patients who did not have the event and have low expression of a gene increase the predictive score of that gene by 1.
Typical expression of a gene in any patient does not change its predictive score.
In this fashion, high absolute predictive gene scores may be achieved by either high or low expression of a given gene being related to patient outcome. Finally, we rank the genes by predictive score and select the most predictive genes. The magnitude of the difference in mean gene expression between the high and low risk groups is used as a tie-breaker. In this fashion, the expression of probe sets that receive the highest scores are associated with high risk tumors (those that reccur within 10 years), and the expression of probe sets that receive the lowest scores are associated with low risk tumors (those that do not reccur within 10 years). In order to estimate the performance of a given signature in an unbiased fashion, and reduce over-fitting, we added capacity for PIMS to perform leave one-out-cross validation. Screenshots of this process as well as detailed instructions can be found in Additional file 2 (PIMS user guide).
Where x is the transformed expression, n is the number of probe sets, P is the set of probes with reported positive correlation to the target probe set, and N is the set of probes with reported negative correlation to the target probe set [13, 15].
The program is contained entirely within an Excel file, therefore requiring no installation. All that is required to operate our program is Excel 2007 or later. Additionally, our program is freely accessible and is included as a supplementary file, which accompanies this manuscript. The code for our program is written in Visual Basic for Applications and is easily accessible from within Excel.
Prediction Analysis of Microarrays (PAM)
PAM was installed and used in R according to the available manual .
Binary Regression (BR)
For the prognostic validation, we calculated the hazard ratio (HR), logrank p-value (median cut-point), area under the ROC curve (AUC), and specificity at 80% sensitivity, to determine the significance of the difference in survival between predicted good and poor survival groups. For the predictive validation, we calculated the odds ratio (OR) and Fisher’s exact test to assess performance. Survival analysis and all associated statistical tests were performed using IBM SPSS Statistics and R.
Identification of a prognostic gene signature
Pathway analysis of the prognostic genes selected by PIMS demonstrated that these genes were enriched in several biological processes previously linked to breast cancer patient outcome (Additional file 3: Table S1). These included regulation of adherens junctions as well as nuclear regulation of SMAD2/3 signaling, which occurs downstream of TGFβ signaling. Given the previous reported linkage between adherens junction, TGFβ signaling and breast patient prognosis, these results confirm the capacity of PIMS to select prognostic genes.
Comparison with other models
Comparison of PIMS, PAM, and binary regression identified prognostic signatures
p-value Log-rank Test
Specificity at 80% sensitivity
Cox regression p-value
p-value Log-rank Test
Specificity at 80% sensitivity
Cox regression p-value
p-value Log-rank Test
Specificity at 80% sensitivity
Cox regression p-value
Comparison of PIMS with randomly generated signatures
Identification of a predictive gene signature
Comparison of PIMS, PAM, and binary regression identified predictive signatures
GSE25065 (Validation cohort)
0.70, p = 0.001
Fisher’s exact test
p = 0.002
0.72, p = 0.0002
Fisher’s exact test
p = 0.002
0.71, p = 0.0004
Fisher’s exact test
p = 0.002
Hence, these data suggest that PIMS identified signatures have the capacity to identify predictive gene signatures. Taken with our previous data, we conclude that PIMS provides a robust means of identifying predictive and prognostic gene signatures in breast cancer.
PIMS identifies prognostic gene signatures in additional tumor types
To confirm that the utility of PIMS was not limited to breast cancers, we tested its capacity to identify prognostic signatures for risk stratification of colon cancer patients. Briefly, we obtained publically available gene expression profiling data for which clinical follow-up data was also available (GSE14333). We randomly divided this cohort into equally sized training and validation cohorts, and implemented PIMS to identify a 12 feature signature that robustly stratified training patients into good and poor outcome groups. Application of this 12 gene signature to the validation cohort demonstrated striking stratification of these patients into high and low risk groups (Additional file 4: Figure S2, HR: 1.3, *p = 0.0004, log-rank test). Taken together, these data demonstrate the capacity for PIMS to identify prognostic signatures in colon cancers. Overall, we conclude that PIMS provides a robust and reproducible method to identify prognostic and predictive gene signatures.
Here, we report a freely accessible and user friendly program to identify predictive and prognostic gene signatures. An important characteristic of our program is that it is all contained within a single Microsoft Excel file. Excel is highly used and widely available: therefore the implementation of our program is very straightforward. By contrast, the vast majority of current feature selection techniques require the use of various clustering and classification algorithms that require installation of advanced statistical software packages as well as a significant time investment for training with the same software.
A comparison of PIMS with PAM and binary regression suggested that PIMS identified signatures that performed with comparable accuracy to other commonly used feature selection techniques. It is noteworthy that for each signature, regardless of their method of derivation, the predictive accuracy diminished between the training and validation groups, suggesting that over-fit occurred during training. This is a common property of feature selection algorithms [10, 26]. To confirm that PIMS selected signatures were robustly associated with the defined clinical variables, we also generated 10,000 randomly selected signatures and compared their predictive capacity with the PIMS selected signature. In this case, the performance of the PIMS signature was within the 99th percentile of the randomly generated signatures, thereby validating the robustness of PIMS selected signatures.
Whereas the experiments presented here focused on identifying prognostic and predictive gene signatures for breast cancer from microarray data, PIMS would also be appropriate to identify similar such signatures for different cancer types (lung, ovarian, colon…etc). Indeed, this notion is supported by our demonstration that PIMS can similarly identify prognostic gene signatures in colon cancer patients. Moreover, PIMS would readily function on other data formats as well, such as RNAseq data, or even copy number array data. Accordingly, we suggest that PIMS is broadly applicable to most commonly used high-throughput techniques used to profile tumors.
We have built upon our previously published feature selection algorithm and packaged it into a freely accessible, user-friendly Excel file. Our data suggest that PIMS identifies gene signatures that are robustly associated with user defined clinical variables. Hence, PIMS represents a broadly applicable method to generate prognostic and predictive gene signatures that we expect will be highly useful to the research community.
The authors wish to acknowledge funding from the Canadian Breast Cancer Foundation and the Canadian Stem Cell Network that supported the research described herein. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
- Hayes DF, Trock B, Harris AL: Assessing the clinical impact of prognostic factors: when is ‘statistically significant’ clinically useful?. Breast Cancer Res Treat. 1998, 52: 305-319. 10.1023/A:1006197805041.PubMedView ArticleGoogle Scholar
- American Society for Clinical Oncology: 1997 update of recommendations for the use of tumor markers in breast and colorectal cancer. Adopted on November 7, 1997 by the American Society of Clinical Oncology. J Clin Oncol. 1998, 16: 793-795.Google Scholar
- Kim C, Paik S: Gene-expression-based prognostic assays for breast cancer. Nat Rev Clin Oncol. 2010, 7: 340-347. 10.1038/nrclinonc.2010.61.PubMedView ArticleGoogle Scholar
- Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, Baehner F, Walker M, Watson D, Park T, Hiller W, Fisher E, Wickerham D, Bryant J, Wolmark N: A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med. 2004, 351: 2817-2826. 10.1056/NEJMoa041588.PubMedView ArticleGoogle Scholar
- Chang HY, Nuyten DS, Sneddon JB, Hastie T, Tibshirani R, Sorlie T, Dai H, He YD, van’t Veer LJ, Bartelink H, van de Rijn M, Brown PO, van de Vijver MJ: Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. Proc Natl Acad Sci U S A. 2005, 102: 3738-3743. 10.1073/pnas.0409462102.PubMedPubMed CentralView ArticleGoogle Scholar
- Van de Vijver MJ, He YD, Van’t Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R: A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med. 2002, 347: 1999-2009. 10.1056/NEJMoa021967.PubMedView ArticleGoogle Scholar
- van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002, 415: 530-536. 10.1038/415530a.View ArticleGoogle Scholar
- Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Buyse M, Van de Vijver MJ, Bergh J, Piccart M, Delorenzi M: Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst. 2006, 98: 262-272. 10.1093/jnci/djj052.PubMedView ArticleGoogle Scholar
- Sotiriou C, Pusztai L: Gene-expression signatures in breast cancer. N Engl J Med. 2009, 360: 790-800. 10.1056/NEJMra0801289.PubMedView ArticleGoogle Scholar
- Haibe-Kains B, Desmedt C, Sotiriou C, Bontempi G: A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all?. Bioinformatics. 2008, 24: 2200-2208. 10.1093/bioinformatics/btn374.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang Y, Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005, 365: 671-679. 10.1016/S0140-6736(05)17947-1.PubMedView ArticleGoogle Scholar
- Hallett RM, Dvorkin A, Gabardo CM, Hassell JA: An algorithm to discover gene signatures with predictive potential. J Exp Clin Cancer Res. 2010, 29: 120-10.1186/1756-9966-29-120.PubMedPubMed CentralView ArticleGoogle Scholar
- Hallett RM, Dvorkin-Gheva A, Anita B, Hassell JA: A gene signature for predicting outcome in patients with basal-like breast cancer. Sci Reports. 2012, 2: 227-Google Scholar
- Hallett RM, Hassell JA: E2F1 and KIAA0191 expression predicts breast cancer patient survival. BMC Res Notes. 2011, 4: 95-95. 10.1186/1756-0500-4-95.PubMedPubMed CentralView ArticleGoogle Scholar
- Hallett RM, Pond G, Hassell JA: A target based approach identifies genomic predictors of breast cancer patient response to chemotherapy. BMC Med Genomics. 2012, 5: 16-10.1186/1755-8794-5-16.PubMedPubMed CentralView ArticleGoogle Scholar
- Subramanian J, Simon R: Gene expression–based prognostic signatures in lung cancer: ready for clinical use?. J Natl Cancer Inst. 2010, 102: 464-474. 10.1093/jnci/djq025.PubMedPubMed CentralView ArticleGoogle Scholar
- Wouters BJ, Löwenberg B, Delwel R: A decade of genome-wide gene expression profiling in acute myeloid leukemia: flashback and prospects. Blood. 2009, 113: 291-298.PubMedPubMed CentralView ArticleGoogle Scholar
- Desmedt C, Piette F, Loi S, Wang Y, Lallemand F, Haibe-Kains B, Viale G, Delorenzi M, Zhang Y, Assignies MS, Bergh J, Lidereau R, Ellis P, Harris AL, Klijn JG, Foekens JA, Cardoso F, Piccart MJ, Buyse M, Sotiriou C: Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clin Cancer Res. 2007, 13: 3207-3214. 10.1158/1078-0432.CCR-06-2765.PubMedView ArticleGoogle Scholar
- Hatzis C, Pusztai L, Valero V, Booser DJ, Esserman L, Lluch A, Vidaurre T, Holmes F, Souchon E, Wang H, Martin M, Cotrina J, Gomez H, Hubbard R, Chacon JI, Ferrer-Lozano J, Dyer R, Buxton M, Gong Y, Wu Y, Ibrahim N, Andreopoulou E, Ueno NT, Hunt K, Yang W, Nazario A, DeMichele A, O’Shaughnessy J, Hortobagyi GN, Symmans WF: A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer. JAMA. 2011, 305: 1873-1881. 10.1001/jama.2011.593.PubMedView ArticleGoogle Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4: 249-264. 10.1093/biostatistics/4.2.249.PubMedView ArticleGoogle Scholar
- Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A. 2002, 99: 6567-6572. 10.1073/pnas.082099299.PubMedPubMed CentralView ArticleGoogle Scholar
- West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR: Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci U S A. 2001, 98: 11462-11467. 10.1073/pnas.201162998.PubMedPubMed CentralView ArticleGoogle Scholar
- Venet D, Dumont JE, Detours V: Most random gene expression signatures are significantly associated with breast cancer outcome. Plos Comput Biol. 2011, 7: e1002240-e1002240. 10.1371/journal.pcbi.1002240.PubMedPubMed CentralView ArticleGoogle Scholar
- Starmans MHW, Fung G, Steck H, Wouters BG, Lambin P: A simple but highly effective approach to evaluate the prognostic performance of gene expression signatures. Plos One. 2011, 6: e28320-10.1371/journal.pone.0028320.PubMedPubMed CentralView ArticleGoogle Scholar
- Symmans WF, Peintinger F, Hatzis C, Rajan R, Kuerer H, Valero V, Assad L, Poniecka A, Hennessy B, Green M, Buzdar AU, Singletary SE, Hortobagyi GN, Pusztai L: Measurement of residual breast cancer burden to predict survival after neoadjuvant chemotherapy. J Clin Oncol. 2007, 25: 4414-4422. 10.1200/JCO.2007.10.6823.PubMedView ArticleGoogle Scholar
- Koscielny S: Why most gene expression signatures of tumors have not been useful in the clinic. Sci Transl Med. 2010, 2: 14ps2-14ps2.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.