Modularity detection in proteinprotein interaction networks
 Tejaswini Narayanan^{1},
 Merril Gersten^{2},
 Shankar Subramaniam^{3}Email author and
 Ananth Grama^{4}Email author
DOI: 10.1186/175605004569
© Narayanan et al; licensee BioMed Central Ltd.. 2011
Received: 31 August 2011
Accepted: 29 December 2011
Published: 29 December 2011
Abstract
Background
Many recent studies have investigated modularity in biological networks, and its role in functional and structural characterization of constituent biomolecules. A technique that has shown considerable promise in the domain of modularity detection is the Newman and Girvan (NG) algorithm, which relies on the number of shortestpaths across pairs of vertices in the network traversing a given edge, referred to as the betweenness of that edge. The edge with the highest betweenness is iteratively eliminated from the network, with the betweenness of the remaining edges recalculated in every iteration. This generates a complete dendrogram, from which modules are extracted by applying a quality metric called modularity denoted by Q. This exhaustive computation can be prohibitively expensive for large networks such as ProteinProtein Interaction Networks. In this paper, we present a novel optimization to the modularity detection algorithm, in terms of an efficient termination criterion based on a target edge betweenness value, using which the process of iterative edge removal may be terminated.
Results
We validate the robustness of our approach by applying our algorithm on realworld proteinprotein interaction networks of Yeast, C.Elegans and Drosophila, and demonstrate that our algorithm consistently has significant computational gains in terms of reduced runtime, when compared to the NG algorithm. Furthermore, our algorithm produces modules comparable to those from the NG algorithm, qualitatively and quantitatively. We illustrate this using comparison metrics such as module distribution, module membership cardinality, modularity Q, and Jaccard Similarity Coefficient.
Conclusions
We have presented an optimized approach for efficient modularity detection in networks. The intuition driving our approach is the extraction of holistic measures of centrality from graphs, which are representative of inherent modular structure of the underlying network, and the application of those measures to efficiently guide the modularity detection process. We have empirically evaluated our approach in the specific context of realworld large scale biological networks, and have demonstrated significant savings in computational time while maintaining comparable quality of detected modules.
Background
The problem of modularity detection in networks has received considerable attention in recent literature [1–5]. Specifically, in the context of biological networks, identification of modules enables functional annotation of constituent biomolecules, discovery of targets for therapeutic intervention and screening etc. More generally, modular decomposition provides us with a higherlevel understanding of the organization of networks and also serves as the basis for other network analysis tasks, such as hierarchical alignment, modular evolution, and orthology.
There are three primary approaches to modularity detection: (i) top down (or divisive) techniques, in which a series of network partitions hierarchically decompose a network into modules, (ii) bottom up (or agglomerative) techniques, in which modules are constructed by adding elements to an initial seed, and (iii) force directed methods, in which suitably designed parameters drive nodes belonging to the same module to spatially proximate regions of space. There have also been investigations focused on relating various classes of methods [6].
Newman and Girvan algorithm
One such divisive technique of interest is the Newman and Girvan (NG) algorithm [1], which uses the notion of edgebetweenness, a metric that has received considerable recent research interest in the domain of modularity detection. Edgebetweenness is typically computed as the number of (pairwise) shortest paths that traverse an edge in a network. This notion, which was first introduced by Anthonisse [7], can be used to compute modules by repeatedly identifying and eliminating the edge with highest betweenness. Note that since the elimination of a single edge (especially one with high betweenness) may cause significant perturbations to the shortest paths, the edgebetweenness of the remaining edges must be recomputed after each edgeelimination.
where, e is a k × k symmetric matrix whose element e_{ ij } is the fraction of all edges in the network that link vertices in module i to vertices in module j;k is the number of modules in the network;
Tr(e) = ∑_{ i } e_{ ii }, is the trace of e, which represents the fraction of edges in the network that connect vertices in the same module;
a_{ i } = ∑_{ j } e_{ ij }, are the row (or column) sums, which represent the fraction of edges that connect to vertices in module i;
E denotes the sum of the elements of matrix E.
We observe that, in a network in which edges fall between vertices without regard for the modules they belong to, e_{ ij } = a_{ i }a_{ j }.
The Q value measures the fraction of the edges that connect vertices within the same module minus the expected value of the same quantity in the network. If the number of intramodular edges is no better than random, we get Q = 0. Values approaching Q = 1, which is the maximum, indicate strong modular structure [1]. In practice, Q values for such networks with strong modular structure typically fall in the range from about 0.3 to 0.7. The modular decomposition of the network (from the dendrogram) with maximum Q value is considered to be the best split by the NG algorithm.
While the computation of modules using the NG algorithm has been shown to perform well in terms of quality of modules, its computational cost can be significant (particularly for large networks such as biological networks). This cost, in part, stems from repeated edge betweenness computations. Furthermore, a level of refinement in the output dendrogram to the individual nodes, is typically unnecessary from an application standpoint, often uninformative, and computationally expensive. Finally, the dendrogram requires additional postprocessing to identify suitable modules based on quality measures associated with the modules. Computing the quality of each module corresponding to every node in the dendrogram is itself expensive. A stopping criterion that identifies a nearoptimal point at which the process of iterative edgeremoval may be terminated would significantly reduce the time and space complexity of the NG algorithm.
The problem of terminating divisive clustering is an important one, especially when the clustering method is itself expensive. A number of other approaches have been proposedincluding use of p values of clusters as termination criteria [8]. However, each of these methods assumes models for underlying data, or specific properties for quality measures applied to modules. For example, the divisive partitioning technique of Koyuturk et al. [8] stops the partitioning process when the p value of a module is lower than a userspecified threshold. This does not guarantee that the optimal p value modules are found. Similarly, for datasets for which precise models are not available, estimation of number of clusters is difficult. Neither class of techniques is directly applicable for divisive partitioning based on the NG algorithm.
In this paper, we experimentally derive an optimized termination criterion for the NG algorithm (which we call the target edgebetweenness), based on initial values of edgebetweenness computed over the input network. In particular, we define the target edgebetweenness to be the geometric mean of edgebetweenness values of all edges in the input network (and hence refer to our algorithm as the Gmean algorithm in the discussion below). A detailed description of our algorithm is included in the Methods section.
Results and discussion
There are two computational problems with the NG algorithm:
1. The iterative removal of edges (preceded by recalculation of edge betweenness in every iteration) is performed until all the edges are removed, leading to a time complexity of O (ne^{ 2 }) for a network of n vertices and e edges (using Brandes' algorithm, assuming connected networks as inputs). This computation becomes prohibitively expensive in the context of large biological networks.
2. The modularity Q is calculated for every partition of a network in the dendrogram. This is necessary for determining optimal splits.
The Gmean algorithm directly addresses these overheads in two fundamental ways: it terminates the process before all edges are removed, thus significantly reducing the first overhead. Since the termination criterion is computed just once (at the start of the algorithm), and does not rely on repeated Q value computations, we eliminate the second overhead altogether.
Furthermore, we demonstrate that our algorithm results in modules with Q values comparable to the maximum Q value from the NG algorithmthus maintaining the quality of the identified modules, while significantly reducing runtime. We also use the Jaccard Similarity Coefficient (a measure of similarity between two sample sets) to show that the resulting modules from both the approaches are similar.
Comparison of computational efficiency
We observe significant and consistent savings in computational cost with our proposed optimization (for the networks in our biological test bed under consideration). Figure 1 presents a comparison of the execution times for the NG and Gmean algorithms.
Comparison of module size and distribution
Comparison of modularity
Comparison of Jaccard similarity coefficient
The Jaccard Index is 1 if the two sample sets are exactly identical, and is equal to 0, if they have no overlap at all.
We use this metric to show the similarity of the modules produced as the output by the NG and the Gmean algorithms. Specifically, we consider the modules produced by the algorithms as sample sets constituted by vertices and calculate the Jaccard Indices J (A,B) for all pairs of modules A and B (one from the output of each algorithm).
where J (A,B) is the Jaccard Index for the modules A and B, one from the output of each algorithm;
J (A,B)* is the ideal Jaccard Index for the modules A and B, one from the output of each algorithm (note that J (A,B)* = 1, corresponding to perfect match, when the two modules A and B are exactly identical);
Σ is the summation over all pairs of modules, one from the output of each algorithm.
Summary of % similarity for biological networks considered
C.Elegans  Yeast  Drosophila  

Σ J(A,B)  4.5472  47.973  40.5089 
Σ J(A,B)  5  48  46 
λ  90.94%  99.94%  88.06% 
Conclusions
In this paper, we have proposed a novel termination criterion for efficient modularity detection in networks. The intuition driving our approach is the extraction of holistic measures of centrality from graphs, which are representative of inherent modular structure, and the application of those measures to efficiently guide the modularity detection process. We have empirically evaluated our approach against existing techniques for modularity detection in the context of biological networks, and have demonstrated significant savings in computational time while maintaining comparable quality of detected modules.
Methods
Existing NG method
In the NG algorithm, the edgebetweenness is computed for each edge in the network under consideration. The edge with the maximum edgebetweenness is identified and eliminated, followed by a recalculation of the edgebetweenness values of all the remaining edges in the resultant network. This process is iteratively repeated till no edges are remaining, thus generating a complete dendrogram which is then traversed to identify the partition with best modularity value Q.
Proposed Gmean method
where G (e) is the geometric mean (gmean) of edgebetweenness values of all edges in the input network. Validation on real networks shows that this choice serves as a robust and highquality termination criterion. Specifically, as stated in the results section, this choice produces a set of modules comparable in quality and quantity to those produced by the NG algorithm. We show this for a number of biological networks of interest. All biological network data used for the experimental study are from publicly available data sources [9, 10].
List of abbreviations
 C.Elegans:

Caenorhabditis elegans
 gmean:

Geometric Mean.
Declarations
Acknowledgements
We acknowledge NSF grant awards Science and Technology Center Grant 0939370, DBI 0835541 and DBI 0641037 which supported this work.
Authors’ Affiliations
References
 Newman MEJ, Girvan M: Finding and evaluating community structure in networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2004, 69 (2 Pt 2): 026113PubMedView ArticleGoogle Scholar
 Bader G, Hogue C: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003, 4: 210.1186/1471210542.PubMedPubMed CentralView ArticleGoogle Scholar
 Dunn R, Dudbridge F, Sanderson CM: The use of edgebetweenness clustering to investigate biological function in protein interaction networks. BMC Bioinformatics. 2005, 6: 3910.1186/14712105639.PubMedPubMed CentralView ArticleGoogle Scholar
 Rives A, Galitski T: Modular organization of cellular networks. PNAS. 2003, 100: 11281133. 10.1073/pnas.0237338100.PubMedPubMed CentralView ArticleGoogle Scholar
 Sharan R, Ideker T, Kelley B, Shamir R, Karp RM: Identification of protein complexes by comparative analysis of yeast and bacterial protein interaction data. J Comput Biol. 2005, 12 (6): 835846. 10.1089/cmb.2005.12.835.PubMedView ArticleGoogle Scholar
 Quigley A, Eades P: FADE: Graph Drawing, Clustering, and Visual Abstraction. SpingerVerlag. 2001Google Scholar
 Anthonisse JM: The Rush in a Directed Graph. Technical Report BN 9/71. Stichting Mathematicsh Centrum, Amsterdam. 1971Google Scholar
 Koyuturk M, Grama A, Szpankowski W: Pairwise local alignment of protein interaction networks guided by models of evolution. Proceedings of ACM RECOMB. 2005, 4865.Google Scholar
 Duch J, Arenas A: Community identification using extremal optimization. Phys Rev E Stat Nonlin Soft Matter Phys. 2005, 72 (2 Pt 2): 027104PubMedView ArticleGoogle Scholar
 The Biogrid. [http://thebiogrid.org/]
 Yang Q, Lonardi S: A parallel edgebetweenness clustering tool for ProteinProtein Interaction networks. Int J Data Min Bioinform. 2007, 1 (3): 241247. 10.1504/IJDMB.2007.011611.PubMedView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.