VAN: an R package for identifying biologically perturbed networks via differential variability analysis

Background Large-scale molecular interaction networks are dynamic in nature and are of special interest in the analysis of complex diseases, which are characterized by network-level perturbations rather than changes in individual genes/proteins. The methods developed for the identification of differentially expressed genes or gene sets are not suitable for network-level analyses. Consequently, bioinformatics approaches that enable a joint analysis of high-throughput transcriptomics datasets and large-scale molecular interaction networks for identifying perturbed networks are gaining popularity. Typically, these approaches require the sequential application of multiple bioinformatics techniques – ID mapping, network analysis, and network visualization. Here, we present the Variability Analysis in Networks (VAN) software package: a collection of R functions to streamline this bioinformatics analysis. Findings VAN determines whether there are network-level perturbations across biological states of interest. It first identifies hubs (densely connected proteins/microRNAs) in a network and then uses them to extract network modules (comprising of a hub and all its interaction partners). The function identifySignificantHubs identifies dysregulated modules (i.e. modules with changes in expression correlation between a hub and its interaction partners) using a single expression and network dataset. The function summarizeHubData identifies dysregulated modules based on a meta-analysis of multiple expression and/or network datasets. VAN also converts protein identifiers present in a MITAB-formatted interaction network to gene identifiers (UniProt identifier to Entrez identifier or gene symbol using the function generatePpiMap) and generates microRNA-gene interaction networks using TargetScan and Microcosm databases (generateMicroRnaMap). The function obtainCancerInfo is used to identify hubs (corresponding to significantly perturbed modules) that are already causally associated with cancer(s) in the Cancer Gene Census database. Additionally, VAN supports the visualization of changes to network modules in R and Cytoscape (visualizeNetwork and obtainPairSubset, respectively). We demonstrate the utility of VAN using a gene expression data from metastatic melanoma and a protein-protein interaction network from the Human Protein Reference Database. Conclusions Our package provides a comprehensive and user-friendly platform for the integrative analysis of -omics data to identify disease-associated network modules. This bioinformatics approach, which is essentially focused on the question of explaining phenotype with a 'network type’ and in particular, how regulation is changing among different states of interest, is relevant to many questions including those related to network perturbations across developmental timelines.


Section 3: Example analysis
For all the analyses, we assume that example dataset (Section 2) is saved in the folder C:/My_Packages. We also assume that the VAN package has been loaded and the current working directory has been set appropriately as shown below -Load the VAN package and set the working directory At the R command prompt type setwd("C:/My_Packages") ## The above command is for Windows users ## Unix and Mac users should set the working directory using the appropriate file path syntax library(VAN) The library command loads VAN and some additional packages (refer Section 1: Installation Steps) into memory. If these packages were developed using a version of R that is different from the version installed on the user's computer, warning messages may be generated. However, these warning messages will not affect the execution of the program.

Combining gene expression and microRNA expression data with a microRNAtarget interactome -Multiple conditions
If multiple conditions are to be evaluated for a combination of gene and microRNA expression data, then the exprFile input parameter should contain two file names, as shown in Example 2.
Note: In the above examples we set randomizeCount=10 for quick execution. We note that this parameter should be set to 1000 during an actual analysis. For an explanation of the output files, refer Section 9.

Section 4: Network Visualization using R and Cytoscape -An example
We provide two options for visualizing the changes in associations between a protein/microRNA hub and its interactors. Typically, the input file for data visualization will correspond to the output correlation file generated by the function identifySignificantHubs (Section 3). For example, if the user was interested in visualizing the hub-interactor correlations corresponding to the analysis performed in Example 1 of Section 3, then the input file will be Test_Output_PPI_Cor.txt. However, to illustrate our data visualization function, we consider input files that are already available in the Example dataset (Section 2).

Option 2: Visualization in Cytoscape
For visualization in Cytoscape, we can use (i) the output file containing the hubinteractor pairs or (ii) a subset of this file. The advantage of using the subset is that it allows us to focus on only the significant hubs. To obtain the subset, at the R command prompt type obtainPairSubset(filePrefix="Gene_Output_1" , useAdjustedProb=FALSE , probThresh=0.05) The above command generates an output file "Gene_Output_1_Cor_Signif.txt". This file contains the hub-interactor pairs for only those hubs which have an unadjusted p-value less than 0.05 We provide an example layout file and a 'color-blind safe' edge palette (created with the aid of http:www.//colorbrewer2.org/ and http://jfly.iam.u-tokyo.ac.jp/color/) for visualization using Cytoscape. For an analysis involving two conditions (e.g., StateA and StateB), the steps to be followed are: 1. Download and install Cytoscape from the website http://www.cytoscape.org/download.html 2. From the Cytoscape Desktop menu bar select "File", "Import", "Network from Table  (Text/MS Excel)" 3. At the "Import Network and Edge Attributes from Table" window: a. Select the input file: this will be the VAN output file Gene_Output_1_Cor_Signif.txt b. At the "Advanced" options, check "Show Text File Import Options" c. At the "Text File Import Options, uncheck the "Delimiter" "Space" d. At "Attribute Names" check "Transfer first line as attribute names" e. At the "Interaction Definition" options, select "Column 1" as the Source Interaction and Column 2 as the Target Interaction f. At the "Preview" click on Columns 3 to 5 to activate import of these data 4. From the menu bar select "Layout", "Cytoscape Layouts", "Circular Layout" 5. From the menu bar select "File", "Import", "Vizmap Property File)" and select the input file: "ExampleVisualStyle.props" 6. At the Cytoscape Desktop Control Panel, click on the VizMapper TM tab and select "example visual style" as the Current Visual Style 7. At the Visual Mapping Browser, select "StateA" in the drop down box at the "Edge Color" field. Next, select "StateB" to visualize how the network co-expression organisation varies as a function of the second condition (StateB) compared with the first (StateA).
For multiple conditions, the above procedure is followed with additional Columns activated at the data import step (refer to 3f, above). This should enable multiple States to be available for viewing as described at 7, above.

Section 5: Meta-analysis of multiple datasets -An example
To combine the results obtained using multiple datasets, we have implemented two meta-analysis methods -Fisher's combined test and RankProd [3]. Since both methods assume independence of datasets, the output p-values should be interpreted with caution if the same expression dataset was combined with multiple protein or microRNA interactomes.
Typically, the input files will correspond to the output files generated by the function identifySignificantHubs for different interaction networks or expression datasets. However, to illustrate the meta-analysis feature of our package, we use the output files provided in the Example dataset (Section 2). At the R command prompt type inputFileVect = c("Gene_Output_1.txt", "Gene_Output_2.txt", "Gene_Output_3.txt") ## Fisher's combined test summarizeHubData(fileNames=inputFileVect , outFile="Summary_Mann_Fisher.txt" , metaAnalysis="Fisher") ## RankProd ## This is the RankProd implementation provided in the R package MADAM (Version ## 1.1) summarizeHubData(fileNames=inputFileVect , outFile="Summary_Mann_RP.txt" , metaAnalysis="RankProd" , rankProdItr=10) In the above example we set rankProdItr=10 for quick execution. We note that this parameter should be set to 1000 during an actual analysis and for an explanation of the output files, refer Section 9. To illustrate this, we use the example file "MiTabLite_Example.txt" described in Section 2. At the R command prompt type generatePpiMap("MiTabLite_Example.txt", "Test_PPI") b. MicroRNA-target interactions: To generate an input microRNA interactome file corresponding to TargetScan [4], at the R command prompt type generateMicroRnaMap("Targetscan", "TS_Mirnome") Similarly, to generate the mapping corresponding to MicroCosm [5], at the R command prompt type generateMicroRnaMap("Microcosm", "MC_Mirnome") In each of the three instances, two output files are generated -one with suffix "Entrez" and the other with suffix "Symb", e.g. Test_PPI_Entrez.txt and Test_PPI_Symb.txt The former contains the hub-interactor pairs as Entrez IDs and the latter as gene symbols. During the generation of the interactome files, some error files may also be generated (refer Section 8).
The function generateMicroRnaMap relies on the mapping information available in the package VANData. This package is updated periodically to correspond to the most recent releases of TargetScan and MicroCosm databases.
As an alternative to the example input interactomes, a user may provide a query network of his own e.g., based on a set of genes of interest. We note that considerations for network selection are discussed in detail in [6], which also provides several selected examples of interactomes and sub-networks generated by other groups using a multitude of methods including hypothesis-driven and meta-mining approaches e.g., [7][8][9][10][11][12][13][14][15].

Section 7: Understanding input data and parameters
Input file formats a. Expression data: VAN does not provide any functions for preprocessing user expression data, such as data normalization to remove systematic variation, or other potential steps including data transformation and/or filtering. These functions are widely available elsewhere and users should provide, as input to VAN, an appropriately normalized and filtered data.
The input expression data file should be a tab-separated text file. The first column should correspond to gene/microRNA names and the remaining columns should correspond to expression values ( Figure 1). Therefore, if there are N samples, the number of columns in the input file should be N + 1. Each gene or microRNA name should occur exactly once in the tab-separated file and the gene names could be provided as Entrez IDs or gene symbols. The file should also contain labels for grouping the N samples. For example, if the N samples correspond to two physiological states, say A and B, then an appropriate label should be assigned to each of the N samples. The keyword LABEL_END is used to separate the labels from the actual expression values and allows the user to provide, in a single expression data file, multiple ways of grouping the samples (Figure 2). For example, let us assume that the N samples can be grouped based on disease status (say StateA and StateB) and mutation (say Wt and Mut) and we are interested in identifying the enriched hubs in both cases. Instead of generating two gene expression data files, one with labels corresponding to disease status and the other with labels corresponding to mutation status, we can provide the two sets of labels in the same expression data file. Figure 1: An example input expression data file. Each column represents a sample and has an associated label (StateA or StateB in the present case). The row that contains the keyword LABEL_END should not have any other value. This can be achieved in two ways based on the software used to generate the expression data file -(i) if the data file was generated in MS-Excel, then the remaining cells in the row corresponding to LABEL_END should be left empty, (ii) if the data was generated using a text editor, then sample labels Blank cell the carriage return (i.e. the key labeled "Enter" on the keyboard) should be pressed immediately after LABEL_END. Figure 2: An example input expression data file with two sets of labels b. Interactome data: The interactome data file is tab-separated and has two columnsthe first column corresponds to hubs and the second column to interactors. The first row of the data file is assumed to be the header and is ignored while reading the data. For protein interactomes, the names of hubs and their interactors can be provided as Entrez IDs or gene symbols but not as a combination of the two. For microRNA interactomes, the names of interactors can be provided as Entrez IDs or gene symbols.

Input parameters
We assume that a network module comprises two types of genes -hub and interactor. The hub gene is connected to all the interactor genes and every interactor gene is connected to only the hub gene. The main function for identifying the enriched network modules is identifySignificantHubs. This function has 14 input parameters and a detailed description of the parameters is provided in the PDF file VAN_Package_Functions.pdf. Of the 14 input parameters, only four have to be explicitly specified by the user -exprFile, labelIndex, mapFile, and outFile. The default values for the remaining 10 parameters need to be changed only under certain conditions and below we describe those conditions.

Parameter
Condition where the default value should be changed hubSize = 5 By default, only those network modules that have at least 5 interactors in the expression data set are considered for downstream analysis. This value can be changed to consider more dense or sparse modules. randomizeCount = 1000 By default 1000 permutations are performed to determine the p-value. The user can select a higher number of permutations, however, it should be noted that an increase in the number of permutations will increase the execution time. The number of permutations can also be lowered but is not recommended. adjustMethod = "BH" By default, the p-values for modules are adjusted using the Benjamimi-Hochberg (or false discovery rate) adjustment method [16], as implemented in R. Alternatively, the bonferroni adjustment may be used. assocType = "TCC" If the number of conditions is two, then the association between a hub-interactor pair can be measured in two ways (refer Section 10). The default option is "TCC" [17] and the other available option is "PCC" (short for Pearson's correlation coefficient) If the number of conditions is more than two, then this parameter must be set to "FSTAT". labelVect = NULL By default, this value is set to NULL and implies that all the N samples in the expression data are used for measuring association.
Sometimes one may be interested in evaluating only a subset of conditions present in the expression data. For example, the N samples in the expression data may correspond to Stage 1, Stage 2, and Stage 3 of cancer and one may be interested in only evaluating the samples that correspond to stages 1 and 2. Rather than creating a new expression data file with just two stages, we can use the existing expression data file and provide (as input) the vector of only those conditions that need to be evaluated. exprDataType = "SYMB" By default, the gene names in the expression data file are assumed to correspond to gene symbols. However, if the gene names correspond to Entrez IDs, this parameter should be changed to "ENTREZ". It should be noted that if two expression data files are provided (one for gene expression and another for microRNAs), then both should contain the same type of gene names i.e. IDs should not be mixed. ppiDataType = "SYMB" By default, the gene names in the interactome data file are assumed to be gene symbols. If that is not the case, this parameter should be set to "ENTREZ". outputDataType = "SYMB" By default, the output files save the hub and interactors as gene symbols. However, the user can choose to save the two as Entrez IDs by setting this parameter to "ENTREZ".

Species = "Human"
Currently only human is supported.

inputCores = 4
This denotes the number of microprocessor cores that are available for executing the code in parallel. The number is decreased automatically if fewer than four cores are available. However, if more than four cores are available, then this number can be increased explicitly.
We note that during the conversion of gene symbols to Entrez Ids, some error files may be generated as described in the next section.

Section 8: Conversion of gene symbols to Entrez IDs
Expression data If the expression data and interactome data contain gene labels in different formats, i.e. one corresponds to Entrez IDs and the other to gene symbols, then the gene symbols are mapped to Entrez IDs. For the mapping process, we utilize the Bioconductor [18] annotation files and, in some instances, the gene symbol to Entrez ID mapping is unavailable. The list of gene symbols that could not be mapped to Entrez IDs is saved in the following filesa. Error_Expr.txt: This file contains the gene symbols that were present in the expression data but could not be mapped to Entrez IDs.
b. Error_PPI_Hubs.txt: This file contains the hubs for which the gene symbols could not be mapped to Entrez IDs.
c. Error_PPI_Int.txt: This file contains the interactors for which the gene symbols could not be mapped to Entrez IDs

Interactome data
The functions generatePpiMap and generateMicroRnaMap are used to generate the interactome data files. These functions return the hub-interactor pairs in two formatsone corresponding to Entrez IDs and the other to gene symbols. As mentioned earlier, in some instances, the Entrez ID to gene symbol mapping is unavailable and the Entrez IDs that could not be mapped to gene symbols are saved in the following filesa. Error_Mirnome_Generation.txt: This file contains the microRNA-gene pairs for which the Entrez IDs could not be mapped to gene symbols b. Error_PPI_Generation.txt: This file contains the gene-gene pairs (hub-interactor pairs) for which the Entrez IDs (for at least one of the genes in the pair) could not be mapped to gene symbols.

Section 9: Understanding output data
All the output data files are tab-separated and can be viewed using a text editor or MS Excel.

Enriched Modules
The function identifySignificantHubs is used to evaluate the network modules (i.e. hubs and their associated interaction partners) for a given combination of expression and interactome data. This function generates two output files -one containing the p-values and the second containing the hub-interactor associations in each of the conditions. While the name of the first file is explicitly provided as an input parameter by the user, the name of the second file is obtained by adding the suffix "_Cor" to the first file.
To visualize the changes in association between a hub protein/microRNA and its interactors using our R function, the second file ("_Cor") is used. We note that only two conditions can be visualized simultaneously using our R function (refer Section 4). Therefore, if the output file corresponds to multiple conditions, then the two conditions to be plotted have to be explicitly specified by the user (refer Section 4, Option 1). To visualise global changes among significantly enriched protein/microRNA hubs and their interactors, using the Cytoscape [2] program, the second file ("_Cor") is filtered (refer Section 4, Option 2). Both the two-and multiple condition files are suitable for upload and visualisation using Cytoscape.

Meta-analysis
The function summarizeHubData is used to perform meta-analysis and aggregate the results obtained using multiple datasets. The meta-analysis output file contains all the modules that were tested for enrichment in at least one of the datasets.
Fisher's combined test: If the meta-analysis method is set to "Fisher", then for a given module, the Fisher's combined test p-value is calculated using only those datasets in which the module was present. The combined p-value is obtained using the unadjusted p-values (for the module) in the individual datasets. This combined p-value is saved as the last column of the meta-analysis output file. We note that the combined p-value provided in the output file is not adjusted for multiple comparisons.
RankProd: If the meta-analysis method is set to "RankProd", then the ranks of the modules in the individual datasets are used to obtain aggregate ranks. If a module was not evaluated in a dataset, its rank is set to K, where K denotes the total number of modules. Also, if in a given dataset, multiple modules have the same p-value, then they are assigned the same rank. The unadjusted and FDR-adjusted RankProd p-values are saved as the last two columns in the meta-analysis output file.

Section 10: Combining output data with known cancer annotation
In case of cancer-related datasets, to facilitate the biological interpretability of the enriched hubs, we provide a function obtainCancerInfo. This function maps the hubs (corresponding to enriched modules) to the catalogue of genes already causally associated with cancer(s), provided the catalogue file is provided as an input parameter in Excel format. For example, to determine whether hubs (associated with enriched modules) in the output file Gene_Output_1.txt (Section 2) correspond to known cancer genes, at the R command prompt type obtainCancerInfo(hubFile = "Gene_Output_1.txt" , cancerAnnotationFile = "Cancer_Gene_Census.xls" , outFile = "Hub_CIC_Info.txt") The output file Hub_CIC_Info.txt contains hubs (with unadjusted p-value < 0.05) that map to known cancer genes.
Unlike the microRNA interactomes which are updated regularly in the VANData package, the Cancer_Gene_Census file is for example purposes only and is not updated. Therefore, we encourage the user to download the most up-to-date version of the annotation file from the website http://www.sanger.ac.uk/genetics/CGP/Census/ prior to executing this R function.

Section 11: Measures of association
Inside our R package, every row of the expression data is median-centered and its variance is set to one prior to the calculation of the association measure. Before we describe the various association measures implemented in our package, we introduce some notation. We denote a hub (microRNA or gene) as u and its set of interactors as R such that r ∈ R is an individual element (or interactor) in the set R. We denote the number of biological states by N and the choice of N determines the association measure.

Number of conditions is two
If N = 2, then the user can select Taylor's correlation coefficient (TCC) [17] or Pearson's correlation coefficient (PCC) as the association measure. ρ denote the correlation coefficient in B1 and B2, respectively, for the u-r pair. Also, u X and r X denote the average expression value for u and r, respectively, over all samples, i.e. over We test the null hypothesis that the average change in association between a hub and its interactors (i.e. ρ ) is not stronger than that by chance. For this purpose, we randomly assign the samples to B1 and B2 and recalculate ρ . We denote the recalculated ρ as * ρ and perform the random assignment of samples Nperm times.
We estimate the p-value as the proportion of times * ρ is at least as large as ρ . If the p-value is less than a user-defined threshold value, say 0.05, then we consider the change in association between a hub and its interactors to be statistically significant.
2. PCC: The PCC differs from TCC in the estimation of r B1 ρ and r B 2 ρ . Unlike Equations (1) and (2), for PCC

Number of conditions is greater than two
For every biological state, we calculate the PCC between a hub u and its interactors r ∈ R. Thus we obtain a matrix M with R N rows (corresponding to the number of interactors) and N columns (corresponding to the number of biological states). We use the matrix M to calculate the F-statistic [19]. We denote the observed F-statistic as f obs and test the null hypothesis that f obs value is not larger than that obtained by chance. For this purpose we perform a permutation test similar to that described earlier for TCC.