These results demonstrate that pathways related to cancer can be readily generated using sets of genes selected randomly from a list of genes used in a standard Affymetrix microarray chip. Empirical testing clearly reveals that the pathways databases can function to amplify the misinformation resulting from false discovery by generating plausible mechanisms to support the results. In addition to obtaining realistic pathways, the support from literature associated with cancer enhances the potential for propagation of misinformation. As knowledge in the domain of pathways increases, more genes are assigned to networks and the probability of generating a network increases. Coincident with this is the fact that the number of publications relating genes to cancer is increasing, and the probability of finding a paper on cancer that includes a gene listed as part of the “discovered” pathway is also increasing over time. The old adage, the more you know, the less you realize that you know can be modified to say that the more we know, the more likely we are to be misinformed by our errors.
It has been pointed out that pathway modeling is one of the most active areas of data analysis for high throughput data[12]. Overfitting is a problem in statistical analysis of high throughput data as there are frequently fewer test subjects than measurements. In addition to this imbalance, the complex algorithms used in bioinformatics can adapt to random noise in data just as they do to actual patterns[13]. Even though techniques for filtering noise, determining significance, and accounting for large amounts of data have proven useful, even the smallest p-values used in studies can create substantial chance error simply because the genome is so large[14].
This means that false gene discovery combined with the expansion of information in pathways databases and literature search engines can lead to the propagation of misinformation. In other words, if input information is incorrect, there may be a relatively high chance of obtaining results that support these false results. Although we have used Metacore’s on-line tool to demonstrate the problems that can arise with pathways searches, we expect this problem to apply to other pathways databases as well, e.g. Ingenuity Path Designer Graphical Representation uses literature sources to generate edges between members of a pathway as does Metacore. Similarly Ariadne uses a database of relationships (ResNet Explore) that is used to generate pathways. Basically, the probability of detecting a network or pathway increases with the increasing size of the knowledge base of interactions in databases that are used to generate a pathway. This clearly presents a danger to research, especially to the field of personalized medicine as applied to cancer. While high throughput technologies have opened the door to novel discoveries for personalized medicine, they can also result in novel discoveries with plausible mechanisms that can be easily generated even if the genes of interest are randomly selected. Misleading hypotheses generated from analyses of high-throughput data are likely to be amplified by both pathway databases and reference libraries as investigators struggle to find significant results amongst all of this information.
Even though further text mining tools[15], and systematic use of keywords[16], might help filter unrelated retrievals, the problem of reviewing the literature in detail remains daunting considering the number of articles retrieved with the Boolean statements submitted. Given that cancer as a disease uses normal pathways even though it dysregulates their activity (overactive or underactive), then it is expected that many genes will have an association with cancer even if they are not causative. Therefore, a future focus is needed to minimize misleading results and to enrich for significant ones. Of course, better methods in the initial analysis to reduce false positive results help to reduce false pathway information, such as randomizing classes and repeating the data analysis. In addition, hypothesized pathways can be supported with some statistical analyses. For example, if “k” nodes of pathway “A” are in the input file to the pathways database, then how frequently would any combination of “k” nodes from pathway “A” be randomly selected from the complete data set (e.g. the probe set for the experiment). Also, percentage of a pathway’s membership that is selected from the initial data analysis as input is important. The higher the percentage of the total number of nodes in a pathway that are selected as statistically meaningful in the data analysis that generates the input nodes, then the more likely it is that the pathway is operating in the data. For example, if two of ten total nodes for pathway “A” are selected in the data analysis, the likelihood that the pathway is meaningful is less than if all ten of the nodes were selected in the initial data analysis.
Certainly, experiments are required to validate a hypothesized pathway and all pathway information should be treated with caution as a hypothesis until proven otherwise. To label a gene or pathway as definitively causal will require external validation in a laboratory setting, such as up- or down-regulation within engineered cell lines. In short, the finding of a network that integrates gene discovery into an acceptable hypothesis and relevant disease-related literature should not be considered strong supporting information.