While most clustering procedures use a pairwise dissimilarity (distance) measure as input, we present a clustering procedure that can accommodate a multi-point dissimilarity measure d(i1, i2, ..., i
P
) where P > 1 is the number of points and the indices i
k
= 1, ..., n run over the n objects. Since we are mainly interested in a network application, we will refer to the objects as nodes and the corresponding measure as multi-node dissimilarity.
A multi-node (P-point) dissimilarity measure d(i1, i2, ..., i
P
) is defined to satisfy the following properties:
-
i)
it takes on non-negative values, i.e.
ii) it equals 0 when all indices are equal, i.e.,
iii) it is symmetric with respect to index permutations, i.e.
Note that this definition reduces to that of a pairwise dissimilarity when P = 2.
Several approaches can be used to define a multi-node dissimilarity measure. For example, a pairwise dissimilarity measure d(i1, i2) gives rise to multi-node dissimilarity measure by averaging all pairwise dissimilarities, e.g. a 3 node measure can be defined as follows
The proposed module affinity search technique (MAST) can be used for any pairwise or multi-node dissimilarity measure. But our particular focus is the multi-node topological overlap measure (reviewed below) since it was found useful in network neighborhood analysis [1]. The topological overlap measure is defined for weighted and unweighted networks that can be represented by a (weighted) adjacency matrix A = [a(ij)], i.e. a symmetric similarity matrix with entries between 0 and 1. In an unweighted network, a(ij) = 1 if nodes i and j are connected and 0 otherwise. In a weighted network, 0 ≤ a(ij) ≤ 1 encodes the pairwise connection strength. Examples of such networks include gene co-expression networks and physical interaction networks.
A simple approach for measuring the dissimilarity between nodes i and j is to define dissA(ij) = 1 - a(ij). However, this measure is not very robust with respect to erroneous adjacencies. Spurious or weak connections in the adjacency matrix may lead to 'noisy' networks and modules [2]. Therefore, we and others have explored the use of dissimilarity measures that are based on common interacting partners or on topological metrics [3–7]. In this article, we will define modules as sets of nodes that have high topological overlap.
Review of the pairwise topological overlap measure
The topological overlap of two nodes reflects their similarity in terms of the commonality of the nodes they connect to. Two nodes have high topological overlap if they are connected to roughly the same group of nodes in the network (i.e. they share the same neighborhood). In an unweighted network, the definition of the pairwise topological overlap measure in the supplementary material of [3] can be expressed in set notation as
where N(i1, i2) denotes the set of common neighbors shared by i1 and i2, N (i1, - i2) denotes the set of the neighbors of i1 excluding i2 and |·| denotes the number of elements (cardinality) of its argument. The Binomial coefficient
= 1 in the denominator of Eq. (2) is an upper bound of a(i1i2). One can show that 0 ≤ a(i1i2) ≤ 1 implies 0 ≤ t(i1i2) ≤ 1 [1, 8, 9], i.e. t(i1i2) can be considered a normalized measure of the number of shared direct neighbors. In the following, we will review how to extend this pairwise measure to multiple nodes.
Review of the multi-node topological overlap measure
The topological overlap measure presented in Eq. (2) is a pairwise similarity measure. While pairwise similarities are widely used in clustering procedures, we have shown that it can be advantageous to consider multi-node similarity measures [1]. There are several ways to extend this measure to a multi-node similarity. The first extension (referred to as average pairwise extension) is to simply average the pairwise measure across all pairwise index selections as described in Eq. (1). The second extension (referred to as stringent MTOM extension) keeps track of the numbers of neighbors that are truly shared among multiple nodes. For example, the MTOM involving 3 different nodes i1,i2,i3 is defined as
where N(i1, i2, i3) is the set of the neighbors shared by i1, i2 and i3 and N (i1, i2, - i3) is the set of the neighbors shared by i1 and i2 excluding i3. The Binomial coefficient
= 3 in the denominator of Eq. (3) is the upper bound of a(i1i2) + a(i1i3) + a(i2i3) and equals the number of connections that can be formed between i1, i2, and i3. One can easily prove 0 ≤ a(i, j) ≤ 1 implies that 0 ≤ t(i1i2i3) ≤ 1. The stringent MTOM extension is related to the average pairwise TOM extension (Eq. 1) but it puts particular weight onto neighbors that are truly shared between the nodes. Below, we briefly compare of the two multi-node extensions (and corresponding dissimilarity measures).
MTOM for weighted networks
To extend the MTOM measure to weighted networks, we express the set notation in terms of the entries of the adjacency matrix as follows
Note that these formulas remain mathematically well defined for a weighted adjacency matrix 0 ≤ a(i1i2) ≤ 1. Therefore, it is natural to use these algebraic formulas for defining the MTOM measure for a weighted networks. It is straightforward to extend the above equations for P = 3 nodes to more than 3 nodes, e.g. our MAST software can deal with any integer P > 1.
Since the topological overlap measure accounts for connections to shared neighbors, it is particularly useful in the context of sparse network, e.g. protein-protein interaction networks [3, 8, 10]. A comparison of the topological overlap measure to alternative measures (e.g. adjacency and correlation-based measures) can be found in [1, 7].
Comparison of MTOM based dissimilarity measures
The MTOM measures is a similarity measure that has an upper bound of 1. To turn it into a dissimilarity we subtract it from 1. For example, a two-node dissimilarity is given by
and a three node dissimilarity measures are given by
As outlined in Eq. (1), an alternative 3-node dissimilarity can be defined as follows
On real data, we find that the two 3 point dissimilarity measures (Eq. 5) and (Eq. 6) are highly correlated (r ≈ .9) across index triplets with distinct indices, i.e. when i1 ≠ i2 ≠ i3. But if 2 indices are equal and different from the third index, (e.g. when i1 = i2 ≠ i3 which entails dissTOM 2(i1, i2) = 0) then the average pairwise dissimilarity (Eq. 6) takes on substantially lower values than dissimilarity (Eq. 5). Across all (possibly non-distinct) index triplets, we find a weak correlation (r ≈ 0.3) between the two 3 point dissimilarity measures. In the following, we proceed with the non-pairwise dissimilarity measures (e.g. Eq. 5).
Review of network neighborhood analysis using MTOM
In a previous publication, we have shown that the multi-node topological overlap measure (MTOM) can perform better than a pairwise measure in the context of network neighborhood analysis [1]. Since network neighborhood analysis is an important step in our proposed MAST procedure for module detection, we will review it in the following. Neighborhood analysis aims to find a set of nodes (the neighborhood) that is similar to an initial 'seed' set of nodes. Intuitively speaking, a neighborhood is composed of nodes that are highly connected to the given seed set of nodes. Since many of our applications involve networks comprised of genes, we will use the words 'node' and 'gene' interchangeably in this article. Neighborhood analysis facilitates a guilt-by-association screening strategy for finding genes that interact with a given set of biologically interesting genes. Informally, we refer to the resulting neighborhood as MTOM neighborhood. The MTOM-based neighborhood analysis requires as input an initial seed neighborhood composed of S0 ≥ 1 node(s) and the requested final size of the neighborhood S
t
= S0 + S, where S is the number of nodes that will be added to the initial neighborhood. For each node outside of the initial (seed) neighborhood, the MTOM software computes the MTOM value with the initial neighborhood. Next the S nodes with the highest MTOM values are selected and added to the initial seed neighborhood.
Computational Speed
In the following, we report computation times for carrying out MTOM neighborhood analysis on a computer with the following specifications: Intel Pentium 4 CPU 2.40 GHz 2.39 GHz, 1.00 G of RAM. To search for 10 neighbors of a gene, MTOM required 10 seconds in a network comprised of 2000 nodes, 2 minutes in a network of 5000 genes, 12 minutes in a network comprised of 10 k genes, and over 1 hour in a network comprised of 150 k genes.
Finding modules (clusters) in a network
The detection of modules in a network is an important task in systems biology since genes and their protein products carry out cellular processes in the context of functional modules. Here we focus on module identification methods that are based on using a node dissimilarity measure in conjunction with a clustering method. We define modules as clusters of densely interconnected nodes. Although our method can be easily be adapted to any (dis-)similarity measure, our software implementation uses an extension of the topological overlap measure, which has been used in many network applications [1, 3, 7, 8, 11, 12]. Numerous clustering methods exist for pairwise dissimilarity measures. We review hierarchical clustering and partitioning around medoids in more detail since we compare them to the proposed MAST procedure.
Review of hierarchical clustering with the pairwise topological overlap measure
Before we describe the details of the MAST procedure, we method we first review a widely used module detection method which defines modules as branches of a hierarchical clustering tree. The standard approach for using the pairwise topological overlap measure for module detection is to use it as input of a hierarchical clustering procedure which results in a cluster tree (dendrogram) [13]. Next clusters (modules) are defined as branches of the cluster tree. Toward this end, one can use the R package dynamicTreeCut which implements several branch cutting methods [14]. Hierarchical clustering procedure has led to biologically meaningful modules in several applications [3, 8, 10, 11, 15–19], but it has one major limitation: it can only accommodate a pairwise dissimilarity measure. In contrast, our MAST procedure can accommodate a multi-node measure.
Review of partitioning around medoid clustering
Partitioning around medoids (PAM, [13]), also known as k-medoid clustering, is a widely used clustering method, which generalizes k-means clustering. Compared to the k-means approach that is based on a Euclidean distance measure, PAM accepts any pairwise dissimilarity measure. The number k of clusters (partitions) is a user defined parameter. The PAM-algorithm iteratively constructs representative objects (medoids) for each of the k clusters. After finding a set of k initial medoids, corresponding k clusters are constructed by assigning each observation to the nearest medoid. After a preliminary clusters assignment, new medoids are determined by minimizing the within cluster dissimilarity measure. The procedure is iterated until convergence to a (local) minimum.
Similar to hierarchical clustering PAM can only accept a pairwise dissimilarity measure. Another limitation of PAM is that it does not allow for un-assigned (un-clustered) nodes. In many gene co-expression network applications, there are thousands of background genes that should not be forced into a module [8, 14].
Review of CAST
The Cluster Affinity Search Technique (CAST) [20] has been found to be an attractive clustering procedure for module detection in gene expression data. CAST is a sequential procedure that defines clusters one at a time. Since we have network applications in mind, we will refer to the clusters as modules. After detecting a module, CAST removes the corresponding nodes from consideration and initializes the next module. Module detection proceeds by adding and removing nodes based on a similarity measure between the nodes and the module members. A node is added if its similarity to the module exceeds a user-defined threshold. At each iteration the 'loosest' node will be dropped if its similarity to the module falls below the threshold. Note that the threshold is constant for the whole clustering procedure, which is why we refer to it as global threshold.