VAN: an R package for identifying biologically perturbed networks via differential variability analysis
BMC Research Notes volume 6, Article number: 430 (2013)
Large-scale molecular interaction networks are dynamic in nature and are of special interest in the analysis of complex diseases, which are characterized by network-level perturbations rather than changes in individual genes/proteins. The methods developed for the identification of differentially expressed genes or gene sets are not suitable for network-level analyses. Consequently, bioinformatics approaches that enable a joint analysis of high-throughput transcriptomics datasets and large-scale molecular interaction networks for identifying perturbed networks are gaining popularity. Typically, these approaches require the sequential application of multiple bioinformatics techniques – ID mapping, network analysis, and network visualization. Here, we present the Variability Analysis in Networks (VAN) software package: a collection of R functions to streamline this bioinformatics analysis.
VAN determines whether there are network-level perturbations across biological states of interest. It first identifies hubs (densely connected proteins/microRNAs) in a network and then uses them to extract network modules (comprising of a hub and all its interaction partners). The function identifySignificantHubs identifies dysregulated modules (i.e. modules with changes in expression correlation between a hub and its interaction partners) using a single expression and network dataset. The function summarizeHubData identifies dysregulated modules based on a meta-analysis of multiple expression and/or network datasets. VAN also converts protein identifiers present in a MITAB-formatted interaction network to gene identifiers (UniProt identifier to Entrez identifier or gene symbol using the function generatePpiMap) and generates microRNA-gene interaction networks using TargetScan and Microcosm databases (generateMicroRnaMap). The function obtainCancerInfo is used to identify hubs (corresponding to significantly perturbed modules) that are already causally associated with cancer(s) in the Cancer Gene Census database. Additionally, VAN supports the visualization of changes to network modules in R and Cytoscape (visualizeNetwork and obtainPairSubset, respectively). We demonstrate the utility of VAN using a gene expression data from metastatic melanoma and a protein-protein interaction network from the Human Protein Reference Database.
Our package provides a comprehensive and user-friendly platform for the integrative analysis of -omics data to identify disease-associated network modules. This bioinformatics approach, which is essentially focused on the question of explaining phenotype with a 'network type’ and in particular, how regulation is changing among different states of interest, is relevant to many questions including those related to network perturbations across developmental timelines.
Network based approaches for analyzing -omics data are necessitated by the daunting intricacies of biomolecular systems and have the potential to quantifiably model large-scale functional networks . These approaches can be broadly divided into two categories – generating novel networks by analyzing -omics data [2–4] and using pre-defined networks to analyze -omics data e.g., [5–7]. Our paper focuses on the latter approach and describes a method for the identification of network modules i.e. a hub and all its interaction partners that are perturbed across biological conditions. Hubs are of special relevance to medicine and disease because they are both the source of network robustness to failure, as well as its weakness (discussed in detail in ). Moreover, proteins with 6 to 38 interaction partners are frequently observed among existing cancer therapeutic targets . To better understand how protein networks may act to control biological responses, the ongoing development of tools to analyze relationships between network structure and function is important.
A network-based analysis of transcriptomics data differs from the more conventional analysis of transcriptomics data. The latter relies on methods that were developed for the identification of differentially expressed genes or gene sets (e.g.,). These methods evaluate changes in the expression levels of genes across biological states rather than changes in the strength of gene-gene correlations across biological states. Given that a disease is often a consequence of localized or large-scale perturbations in the strength of molecular interactions rather than changes in individual genes  there is a need for methods that focus on gene-gene correlations. Such methods examine whether, for a given network module, the average gene expression correlation between a hub (i.e. a densely-connected node) and its interaction partners is the same among two biological conditions. This approach was initially proposed by Han and colleagues in yeast  to measure the correlation (tightness) within a network module and was subsequently applied by Taylor et al. to analyse breast cancers in a study that also included examination of the prognostic utility of such modules. The current approach by Taylor et al. has two main limitations. First, it does not account for variability in the results owing to differences in the expression and/or network datasets used. Second, the test statistic is limited to the analysis of two biological conditions of interest. In addition to these issues, there is a growing requirement for the availability of user-friendly software, suitable for a broad community of end-users, for the implementation of network-centric analyses.
In this paper, we introduce a data analysis pipeline for the identification of dysregulated network modules using one or more transcriptomic datasets and molecular interaction networks. If multiple datasets or networks are provided as input to our pipeline, then we provide an end-user the option of integrating the results using meta-analysis approaches. We illustrate the benefit of our pipeline using a publicly available melanoma dataset and protein-protein interaction (PPI) dataset and identify hubs of potential relevance to melanoma biology.
Variability Analysis in Networks (VAN) provides a suite of tools for testing and visualizing the dysregulation of modules in molecular interaction networks (Figure 1 see also Additional file 1 – VAN User Guide).
Significant network modules
VAN enables an integrative analysis of (a) gene expression data with PPI network data or (b) gene and microRNA expression data with microRNA-gene interaction network data using the function identifySignificantHubs. Firstly, VAN identifies the hubs where the number of interaction partners, which are present in both the network and expression data, is greater than a user-defined threshold value e.g., a hub may be defined as a gene/microRNA with at least five interaction partners. Each hub, along with its interaction partners, represents a network module. Secondly, VAN calculates the correlation of gene expression between the hub and each of its interaction partners in every biological state of interest. Thirdly, it generates the statistic for testing the null hypothesis that the average correlation is the same in all the biological states. The test statistic is similar to that defined by  for two conditions and an F-statistic for multiple conditions. Finally, VAN estimates the p-value for the test statistic using a permutation test and the user can specify the number of permutations to be performed (refer Additional file 1, VAN User Guide Section 11: Measures of association, for a detailed description of the test statistics and permutation tests). A small p-value provides evidence that a network module is dysregulated.
Meta-analysis of network modules
VAN also enables the identification of the subset of modules for which dysregulation (in terms of gene expression correlation with interaction partners) is reproducible across independent cohorts and/or interaction networks (Additional file 1, VAN User Guide Section 5. Meta-analysis of multiple datasets – an example). There are many publicly-available PPI and microRNA-gene interaction networks of varying quality and coverage (discussed in more detail in ). As such, VAN explicitly refrains from imposing a specific network on users. Given that there is an extensive and well-documented lack of overlap in the interaction information among network databases [14–18], even for a single gene expression dataset, the set of dysregulated modules is likely to vary from one network dataset to another. This necessitates a meta-analysis of the results obtained using multiple datasets to identify the candidate modules for downstream analysis and/or validation. VAN provides the function summarizeHubData for meta-analysis and currently this function supports two methods – Fisher’s combined test and RankProd . In Fisher’s combined test, the overall p-value for a network module being dysregulated is computed using the test statistic . Here, N denotes the number of transcriptomics dataset and interaction network combinations and pi denotes the probability (of the module being dysregulated) obtained using the ith combination. In contrast, RankProd computes the overall p-value using the rank of the network module in each of the N combinations; a network module that is consistently ranked high will have a low overall p-value.
An integrative analysis of transcriptomics and PPI data requires the two types of data to map to a common identifier (Additional file 1, Section 6: Generating microRNA-target or protein-protein interaction interactome and Section 8: Conversion of gene symbols to Entrez IDs). In practice, many PPI networks are based on UniProt identifiers whereas transcriptomics data are based on Entrez identifiers or NCBI gene symbols. For ease-of-use, the VAN package automatically maps the various identifiers to one another. VAN provides two functions – generatePpiMap and generateMicroRnaMap – for creating the input interaction network data. The former function transforms PPI data available in MITAB-lite format (e.g., data downloaded from the protein interaction source iRefWeb ), which contains the UniProt identifiers of interacting proteins, into hub-interactor pairs such that the pairs correspond to Entrez identifiers or gene symbols. The latter function generates microRNA-target gene pairs using the TargetScan [21, 22] or Microcosm [23, 24] databases.
Network interpretation is greatly aided by visualization . Therefore, VAN provides the function visualizeNetwork for plotting the strength of correlation between a hub and each of its interactors via color-coded undirected edges (Figure 2). This tool is available for an analysis involving two conditions (Additional file 1, VAN User Guide Section 4: Option 1: Visualization in R).
For a global visualization of networks of interest, VAN also provides an output file that is directly importable into Cytoscape  a popular network visualization tool. To aid with network comprehension in Cytoscape, we also provide an example layout and a 'color-blind safe’ edge palette (created with the aid of ColorBrewer 2.0 and Color Universal Design). The edge palette is supplied as a 'Vizmap property file’ (ExampleVisualStyle.props) and can be directly imported into Cytoscape, as described in the VAN User Guide (Additional file 1). This tool is applicable to the visualization of two or more conditions of interest (Additional file 1, VAN User Guide Section 4: Option 2: Visualization in Cytoscape).
Extended analysis of cancer datasets
The function obtainCancerInfo is used to map the hubs from significantly perturbed network modules to an externally curated catalogue of genes, the cancer Gene Census , that are already causally associated with cancer(s). Section 6 of the VAN user guide (Combining output data with known cancer annotation) provides additional details of this aspect of the software which is also illustrated in the example below.
An example PPI network analysis
We used VAN to identify potentially dysregulated modules in relation to patient clinical outcome in disseminated melanoma. For this purpose, we analyzed a publicly available metastatic melanoma gene expression dataset  in the context of a PPI network from the Human Protein Reference Database (HPRD, Release 9, April 13, 2010) . The gene expression data corresponded to 45 patients that were split into two groups based on survival time (greater than four years and less than one year). The HPRD PPI network was manually filtered to include only direct, physical PPIs where data were strictly taken only from normal human tissues, denoted as in vivo (vv). The VAN-based analysis comprised multiple steps. Firstly, we identified hubs and their interaction partners that were present in both the expression and network datasets. Secondly, we evaluated the modules (hubs with at least five interaction partners) for potential dysregulation. Of the 1649 modules evaluated, 81 were potentially dysregulated (p-value < 0.05). The resulting network was visualized using two separate platforms, R (Figure 2) as well as Cytoscape (7) (Figure 3), both of which are facilitated by VAN (see Methods) Finally, we searched the cancer Gene Census  database to determine whether one or more of the 81 hubs (associated with the dysregulated modules) have previously been causally implicated in cancer(s). We observed that 10 of the 81 hubs – CCND2, CCND3, FANCA, FANCD2, GATA2, KIF5B, LMO2, RET, VHL, WAS (Table 1) have at least one such relationship. Thus, the combination of prior knowledge about mutations with global gene expression data and PPI networks can generate a biologically meaningful context, in this case by pointing to where further mutation discovery could be focused.
In addition to the analysis performed herein, further examples of each of VAN’s functions are provided in the Additional file 1 as part of the VAN User Guide. Section 2 of the user guide describes a number of example gene expression, interactome and VAN output datasets (also refer to Section 7 of the user guide for input data formats). Section 3 of the user guide contains the R code for: 1) analyzing gene expression data (comprising two conditions) with a PPI dataset; 2) analyzing gene expression and microRNA expression data (comprising two conditions) with a microRNA-target interactome; and, 3) analyzing gene expression data (comprising more than two conditions) with a PPI dataset or microRNA-target interactome. Section 5 of the user guide contains example code for meta-analysis.
Integrative -omics approaches are increasingly popular and will become standard practice in the analysis of complex diseases . Our open source R software package, VAN, provides one possible application of this paradigm. By integrating -omics data with network and mutation data, VAN has the potential to identify network modules (and hubs) of biological relevance to complex human diseases. Although the resulting models of network module dysregulation are largely explanatory/descriptive rather than mechanistic, they do have the potential to highlight dysfunctional pathways, network-centric candidate biomarkers, and/or therapeutic target networks [1, 30]. Given that VAN enables the testing of modules for dysregulation based on two or more conditions, it is also suitable for the examination of changes across developmental timelines.
Availability and requirements
Package name: VAN.
Package repository: sourceforge.
URL for downloading the package: https://sourceforge.net/p/variabilityanalysisinnetworks/wiki/Home/
Operating system(s): Platform independent.
Programming language: R.
Other requirements: R version 2.15.1 or higher.
License: GNU Lesser GPL.
Any restrictions to use by non-academics: None.
The sourceforge project repository for VAN contains –
User guide: A step-by-step guide for installing and executing the R package. The user guide also contains example code for data analysis and visualization.
Package functions: A pdf file containing an exhaustive list of all the functions (along with their input and output parameters) available for data analysis and visualization.
R package: The R packages are available for execution on Microsoft Windows©, UNIX, and Mac OS X.
Example dataset: A collection of input and output files for executing the data analysis and visualization examples provided in the user guide.
Source code: VAN’s source code.
Variability analysis in networks.
Ideker T, Krogan NJ: Differential network biology. Mol Syst Biol. 2012, 8: 1-9.
Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS: Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007, 5: 54-66. 10.1371/journal.pbio.0050054.
Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A: ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. Bmc Bioinform. 2006, 7: S7-
Yeung KY, Dombek KM, Lo K, Mittler JE, Zhu J, Schadt EE, Bumgarner RE, Raftery AE: Construction of regulatory networks using expression time-series data of a genotyped population. Proc Natl Acad Sci USA. 2011, 108: 19436-19441. 10.1073/pnas.1116442108.
Taylor IW, Linding R, Warde-Farley D, Liu Y, Pesquita C, Faria D, Bull S, Pawson T, Morris Q, Wrana JL: Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nat Biotechnol. 2009, 27: 199-204. 10.1038/nbt.1522.
Davis MJ, Shin CJ, Jing N, Ragan MA: Rewiring the dynamic interactome. Mol Biosyst. 2012, 8: 2054-2066. 10.1039/c2mb25050k.
Yao C, Li H, Zhou C, Zhang L, Zou J, Guo Z: Multi-level reproducibility of signature hubs in human interactome for breast cancer metastasis. BMC Syst Biol. 2010, 4: 151-10.1186/1752-0509-4-151.
Taylor IW, Wrana JL: Protein interaction networks in medicine and disease. Proteomics. 2012, 12: 1706-1716. 10.1002/pmic.201100594.
Hase T, Tanaka H, Suzuki Y, Nakagawa S, Kitano H: Structure of protein interaction networks and their implications on drug design. PLoS Comput Biol. 2009, 5: e1000550-10.1371/journal.pcbi.1000550.
Efron B, Tibshirani R: On testing the significance of sets of genes. Ann Appl Stat. 2007, 1: 107-129. 10.1214/07-AOAS101.
Barabasi A-L, Gulbahce N, Loscalzo J: Network medicine: a network-based approach to human disease. Nat Rev Genet. 2011, 12: 56-68. 10.1038/nrg2918.
Han J-DJ, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, Dupuy D, Walhout AJM, Cusick ME, Roth FP, Vidal M: Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature. 2004, 430: 88-93. 10.1038/nature02555.
Schramm S-J, Jayaswal V, Goel A, Li SS, Yang YH, Mann GJ, Wilkins MR: Molecular interaction networks for the analysis of human disease: utility, limitations, and considerations. Proteomics. 2013, Accepted
Mathivanan S, Periaswamy B, Gandhi T, Kandasamy K, Suresh S, Mohmood R, Ramachandra Y, Pandey A: An evaluation of human protein-protein interaction data in the public domain. BMC Bioinformatics. 2006, 7: S19-
De Las RJ, Fontanillo C: Protein–protein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Comput Biol. 2010, 6: e1000807-10.1371/journal.pcbi.1000807.
Koh GCKW, Porras P, Aranda B, Hermjakob H, Orchard SE: Analyzing protein–protein interaction networks. J Proteome Res. 2012, 11: 2014-2031. 10.1021/pr201211w.
Kirouac D, Saez-Rodriguez J, Swantek J, Burke J, Lauffenburger D, Sorger P: Creating and analyzing pathway and protein interaction compendia for modelling signal transduction networks. BMC Syst Biol. 2012, 6: 29-10.1186/1752-0509-6-29.
Janjić V, Pržulj N: Biological function through network topology: a survey of the human diseasome. Brief Funct Genomics. 2012, 11: 522-532. 10.1093/bfgp/els037.
Hong FX, Breitling R, McEntee CW, Wittner BS, Nemhauser JL, Chory J: RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics. 2006, 22: 2825-2827. 10.1093/bioinformatics/btl476.
Turner B, Razick S, Turinsky AL, Vlasblom J, Crowdy EK, Cho E, Morrison K, Donaldson IM, Wodak SJ: iRefWeb: interactive analysis of consolidated protein interaction data and their supporting evidence. Database. 2010, 2010: 1-15.
Grimson A, Farh KK-H, Johnston WK, Garrett-Engele P, Lim LP, Bartel DP: MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Mol Cell. 2007, 27: 91-105. 10.1016/j.molcel.2007.06.017.
Lewis BP, Burge CB, Bartel DP: Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are MicroRNA targets. Cell. 2005, 120: 15-20. 10.1016/j.cell.2004.12.035.
Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ: miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res. 2006, 34: D140-D144. 10.1093/nar/gkj112.
Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ: miRBase: tools for microRNA genomics. Nucleic Acids Res. 2008, 36: D154-D158. 10.1093/nar/gkn221.
Fung DCY, Li SS, Goel A, Hong S-H, Wilkins MR: Visualization of the interactome: what are we looking at?. Proteomics. 2012, 12: 1669-1686. 10.1002/pmic.201100454.
Smoot ME, Ono K, Ruscheinski J, Wang P-L, Ideker T: Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics. 2011, 27: 431-432. 10.1093/bioinformatics/btq675.
Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nat Rev Cancer. 2004, 4: 177-183. 10.1038/nrc1299.
Mann GJ, Pupo GM, Campain AE, Carter CD, Schramm S-J, Pianova S, Gerega SK, De Silva C, Lai K, Wilmott JS, et al: BRAF mutation, NRAS mutation, and the absence of an immune-related expressed gene profile predict poor outcome in patients with stage III melanoma. J Invest Dermatol. 2013, 133: 509-517. 10.1038/jid.2012.283.
Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, et al: Human protein reference database—2009 update. Nucleic Acids Res. 2009, 37: D767-D772. 10.1093/nar/gkn892.
Liu ZP, Wang Y, Zhang XS, Chen L: Network-based analysis of complex diseases. Syst Biol, IET. 2012, 6: 22-33. 10.1049/iet-syb.2010.0052.
We thank Drs David C. Y. Fung and Chi Nam Ignatius Pang for their in-valuable assistance in redacting the PPI networks. We thank Natalie Twine and Professor Moustapha Kassem for their generous provisions of time and data to perform independent tests of our software package.
This work was supported by the Australian Research Council Postdoctoral Fellowship (grant # DP1095320 to VJ); The University of Sydney Postgraduate Award and Sydney Medical School Research Top-up Scholarship (to SJS); Australian Research Council Future Fellowship (grant # FT0991918 to YHY); National Health and Medical Research Council of Australia (program grant # 633004 to GJM); Cancer Institute of New South Wales (translational program grant # 10TPG/1/02 to GJM); Education Investment Fund Super Science Scheme (to MRW); and New South Wales State Government Science Leveraging Fund (to MRW).
The authors declare that they have no competing interests.
Conception, design, programing and testing of VAN (VJ/YHY); Major contributions to the design and testing of VAN (VJ/SJS); Writing the manuscript (VJ/SJS/MW); Writing the User Guide (VJ/SJS); Figure preparation and development (VJ/SJS/YHY/MW); Administrative tasks (VJ/SJS); Supervision of data analysis (YHY); Overall scientific direction of the project and final approval of the manuscript (YHY/MW/GJM).
Electronic supplementary material
About this article
Cite this article
Jayaswal, V., Schramm, SJ., Mann, G.J. et al. VAN: an R package for identifying biologically perturbed networks via differential variability analysis. BMC Res Notes 6, 430 (2013). https://doi.org/10.1186/1756-0500-6-430