Parallel mutual information estimation for inferring gene regulatory networks on GPUs
 Haixiang Shi^{1}Email author,
 Bertil Schmidt^{1},
 Weiguo Liu^{1} and
 Wolfgang MüllerWittig^{1}
DOI: 10.1186/175605004189
© Shi et al; licensee BioMed Central Ltd. 2011
Received: 1 March 2011
Accepted: 15 June 2011
Published: 15 June 2011
Abstract
Background
Mutual information is a measure of similarity between two variables. It has been widely used in various application domains including computational biology, machine learning, statistics, image processing, and financial computing. Previously used simple histogram based mutual information estimators lack the precision in quality compared to kernel based methods. The recently introduced Bspline function based mutual information estimation method is competitive to the kernel based methods in terms of quality but at a lower computational complexity.
Results
We present a new approach to accelerate the Bspline function based mutual information estimation algorithm with commodity graphics hardware. To derive an efficient mapping onto this type of architecture, we have used the Compute Unified Device Architecture (CUDA) programming model to design and implement a new parallel algorithm. Our implementation, called CUDAMI, can achieve speedups of up to 82 using double precision on a single GPU compared to a multithreaded implementation on a quadcore CPU for large microarray datasets. We have used the results obtained by CUDAMI to infer gene regulatory networks (GRNs) from microarray data. The comparisons to existing methods including ARACNE and TINGe show that CUDAMI produces GRNs of higher quality in less time.
Conclusions
CUDAMI is publicly available opensource software, written in CUDA and C++ programming languages. It obtains significant speedup over sequential multithreaded implementation by fully exploiting the compute capability of commonly used CUDAenabled lowcost GPUs.
Background
Mutual information (MI) is used to measure the mutual dependence between two random variables in information theory. As an information theoretic approach, it has been used in various areas including physics [1], image processing [2, 3], speech recognition [4], and bioinformatics [5, 6]. An advantage of MI compared to many other similarity measures (such as Pearson correlation), is its capability to detect nonlinear correlations between the two variables [7]. In [1], a recursive method for calculating MI is presented and used on dynamical systems and chaotic data. An overview about MI used in medical imaging applications is presented in [2]. An MI application to find features for audiovisual speech recognition tasks is proposed in [4]. Zhou et al. [5] used MI for gene clustering to determine gene regulations. In [6], MI is used to measure nonlinear relationships between the expressions of two genes. Because of its inherent algorithmic complexity, the reverse engineering or inference of gene regulatory networks (GRNs) from geneexpression profile (microarray) data remains a big challenge in system biology [7–9]. Approaches such as relevance networks [10], informationtheoretic methods [11], and Bayesian networks [12, 13] have been used to infer GRNs. However, due to their high computational complexities, these methods are very timeconsuming and cannot be used to process large datasets. Daub et al. [7] proposed a Bspline function based MI estimation algorithm which can achieve accuracy comparable to kernelbased MI approaches. Since the Bspline based approach has a lower computational complexity, it is widely used in practice. However, it is still highly timeconsuming for big gene expression datasets. Since availability and size of such datasets is growing rapidly, finding fast solutions is of high importance to research in this area.
In this paper, we present a new approach to accelerate Bspline function based MI estimation using the CUDA programming model. We take advantage of shared memory for fast I/O operations to gain efficiency. We further use double precision floating point arithmetic as well as an efficient partitioning scheme to overcome the GPU device memory limitation for big datasets. We evaluate our implementation for a number of microarray datasets. It achieves speedups up to 82 on an Nvidia Tesla C2050 GPU compared to the publicly available multithreaded implementation by Daub et al. [7] running on an Intel i7 quadcore CPU. We use our MI results to infer GRNs by replacing the timeconsuming MI calculation in ARACNE [14]. The results show that the runtime needed for inferring GRNs is much shorter and quality of the resulting network is better by using our MI values compared to the original ARACNE.
The paper presented by Wilson et al. [15] is similar to the approach presented in this paper since it also uses CUDAenabled GPUs to accelerate Bspline based MI estimation. However, it only implements the weighting matrix computation on the GPU. Weighting matrix calculation is only one step in the Bspline based MI estimation algorithm. The remaining steps are performed on the CPU, thus limits the possible speedup. The solution presented in this paper overcomes this limitation by accelerating all steps of the Bspline based MI estimation using multiple CUDA kernels. Our experiments therefore show significant higher speedups than the ones reported in [15].
Bspline Based Mutual Information Estimator
Definition of Mutual Information
From Eqn. (3), we can see that the MI is zero when X and Y are independent
Estimating Mutual Information for Continuous Data
After estimating the probabilities using Eqns. (4) and (6), we can compute MI using Eqns. (1) to (3). In the simple binning method mentioned above, each continuous value is assigned to exactly one bin. Values near to the border of a bin can be shifted to neighboring bins by small fluctuations. Thus, the MI result is strongly affected by noise. In order to overcome this shortcoming of the simple binning method, Daub et al. [7] introduced the Bspline approach. In Daub's approach, each measurement can be assigned to multiple bins with weights given by Bspline functions.
Bspline Functions
where i ∈ [1, R] and z ∈ [0, R  k + 1] is the domain interval of the Bspline function. Note that continuous values should be normalized to fit into the domain interval. Assume we have M continuous data points in the dataset, x_{1}, ..., x_{ M } , the normalization procedure is as follows.
1. Find the minimum and maximum value x_{ min } and x_{ max } among all M data points.
where R is the number of bins used in the Bspline algorithm and k is the spline order of the Bspline function.
Sequential MI estimator
Our parallel CUDA algorithm (described in Methods) extensively uses the concept of the weighting matrix of each given random variable, which is defined as follows.
Definition Weighting Matrix (WM): Consider a random variable X = {x_{1}, x_{2}, ..., x_{ M } }, R bins {a_{1}, ..., a_{ R } }, and the Bspline function order k. The weighting matrix for X is an M × R matrix denoted as W (X), where W (X)_{ i }_{,}_{ j } contains weighting coefficient of value x_{ i } in bin a_{ j } ; i.e. W (X)_{ i }_{,}_{ j } = B_{ j }_{,}_{ k } (z_{ i } ) where z_{ i } is the Bspline domain normalized value of x_{ i } for each 1 ≤ i ≤ M and 1 ≤ j ≤ R.
Based on Eqns. (1) to (8), we can now outline the sequential Bspline based MI estimation algorithm. It consists of two parts: WM(X, R, k) and Single _MI (X, Y, R, k), which are described in Algorithm 1 and Algorithm 2.
Algorithm 1: W M(X, R, k)
Input: Random variable X = {x_{1}, ..., x_{ M } }, number of bins R, BSpline order k
Output: W (X)
foreach i, 1 ≤ i ≤ M do
Calculate the normalized variable z_{ i } , (i = 1, ..., M) using Eqn. (9);
foreach j, 1 ≤ j ≤ R do
Calculate the weighting coefficient B_{ j }_{,}_{ k } (z_{ i } ) using Eqn. (7) and (8) with the normalized value
z_{ i } ;
end
end
Output W (X)_{ i }_{,}_{ j } = B_{ j }_{,}_{ k } (z_{ i } )
Algorithm 2: Single_M I (X, Y, R, k)
Input: Random variable measurements X = {x_{1}, ..., x_{ M } } and Y = {y_{1}, ...,y_{ M } }, number of bins R, BSpline order k
Output: M I (X, Y )
Call W M (X, R, k) to get W (X);
Call W M (Y, R, k) to get W (Y);
foreach j, 1 ≤ j ≤ R do
Calculate the probability of each bin for X and Y ;
end
Calculate the self entropy H (X) and H (Y);
foreach k, 1 ≤ k ≤ R do
foreach l, 1 ≤ l ≤ R do
Calculate joint probabilities;
end
end
Calculate the joint entropy;
Calculate the mutual information using Eqn. (3);
The Single_MI algorithm shows how the MI for one pair of random variables is calculated. Practical applications of MI usually have a large number of random variables as input, where the mutual information of each pair of variables needs to be computed. For example, in this paper we are interested in the analysis of gene expression data, where the input consists of N genes Ω = {X_{1}, ..., X_{ N } }, where N is typically a few thousands. For each gene X_{ i } we have M gene expression measurements; i.e. X_{ i } = {x_{ i }_{1}, ..., x_{ iM } }. We then want to calculate M I (X_{ i } , X_{ j } ) for all 1≤ i ≤ N  1, i < j ≤ N. The resulting matrix of pairwise MI values can be used as input to a subsequent clustering algorithm. The algorithm to calculate all pairwise MI values is outlined in Algorithm 3.
Algorithm 3: Pairwise _MI (Ω, R, k)
Input: N Random Variables Ω = {X_{1}, ..., X_{ N } } consisting of M measurements each; i.e., X_{ i } = {x_{ i }_{1}, ..., x_{ iM } } for all 1 ≤ i ≤ N; number of bins: R; BSpline order: k.
Output: M I (X_{ i } , X_{ j } ) for all 1 ≤ i ≤ N  1, i < j ≤ N
foreach i, 1 ≤ i ≤ N do
W M (X_{ i } , R, k);
end
foreach i, 1 ≤ i ≤ N  1 do
foreach j, i ≤ j ≤ N do
Call Single _MI (X_{ i } , X_{ j } , R, k) to calculate MI for this gene pair using Algorithm 2;
end
end
Complexity Analysis
The most time consuming step of Algorithm 1 is the inner forloop. Assuming that the evaluation of a Bspline function call takes O(k) time, the time complexity of this step is O(M × R × k). We further need O (M × R) space to store the output weighting matrix. Since W M (X_{ i } , R, k) is called N times in Algorithm 3, this leads to a time complexity of O (N × M × R × k) and space complexity of O (N × M × R) for the first forloop. The nested forloop of Algorithm 3 calls Algorithm 2 O(N^{2}) times. The time consuming part of Algorithm 2 is determined by the nested forloop which has time complexity O (M × R^{2}). Thus, the overall time complexity of Algorithm 3 is O (N^{2} × M × R^{2}). Note that k and R are usually significantly smaller than N and M.
CUDA Programming Model
As an extension of C/C++, CUDA (Compute Unified Device Architecture) is used to write scalable multithreaded programs for CUDAenabled GPUs [16]. CUDA programs can be executed on GPUs with NVIDIA's Tesla unified computing architecture. Examples of CUDA enabled GPUs ranging from GeForce 8/9/200, Tesla 800/1000, C1060, C2050, to Quadro FX 3000/4000/5000 series.
CUDA programs contain a sequential part, called the kernel program. The kernel is written in conventional scalar Ccode. It represents the operations to be performed by a single thread and is invoked as a set of concurrently executing threads. These threads are organized in a hierarchy consisting of socalled thread blocks and grids. A thread block is a set of concurrent threads and a grid is a set of independent thread blocks. Each thread has an associated unique ID. Similar to MPI processes, CUDA provides each thread access to its unique ID through corresponding variables. The total size of a grid (dimGrid) and a thread block (dimBlock) is explicitly specified in the kernel functioncall:
Kernel <<< dimGrid, dimBlock >>> (parameter list);
The hierarchical organization into blocks and grids has implications for thread communication and synchronization. Although threads located in different blocks cannot communicate or synchronize directly, threads within a thread block can communicate through a perblock shared memory (PBSM) and may synchronize using barriers. The Tesla architecture supports CUDA applications using a scalable processor array. The array consists of a number of streaming multiprocessors (SM). In the latest Fermi architecture [17], each SM contains 32 scalar streaming processor (SP) cores, which share a PBSM of size up to 48 KB. All threads of a thread block are executed concurrently on a single SM. The SM executes threads in small groups of 32, called warps, in singleinstruction multiplethread (SIMT) fashion. Thus, parallel performance is generally penalized by datadependent conditional branches and improved if all threads in a warp follow the same execution path. There are two types of parallelism supported by CUDA. First, a large number of threads can run in parallel independently. We call this type of parallelism the coarsegrained. Second, multiple threads within each thread blocks can cooperate on some memory spaces (such as the shared memory) simultaneously. For instance, shared memory I/O request made of n addresses can be serviced in a single clock cycle at the same time. Threads in the same thread block can cooperate together by efficiently sharing data and synchronizing their execution to coordinate memory access with other threads. We call this type of parallelism finegrained.
Previous work on using CUDA for computational biology focused on sequence alignment [18–21], and molecular dynamics simulations [22]. In this paper we present the parallel Bspline function based MI estimation which can help to infer GRNs using CUDA. So far, parallel MI estimation using Bsplines has been limited to multithreading on multicore CPUs [7] and MPI on distributed memory clusters [8, 23]. The results presented in this paper indicate that the CUDA approach can provide higher performance at reasonable hardware cost.
Methods
Parallel MI Estimation using CUDA
Algorithm 4: CUDAbased MI estimation algorithm
Input: N genes, each with M experiments.
Output: Pairwise MI values.
/*Host programs executed on CPU*/
Initialize parameters controlling MI estimation;
Load gene expression data into GPU device memory and launch the kernels;
/*Kernel program executed on GPU*/;
Compute the WM for each gene (Kernel 1);
Check the data integrity of the input data (Kernel 2);
Compute the self entropy for each gene (Kernel 3);
Compute the joint entropy and MI value for each gene pair (Kernel 4);
/*Host programs executed on CPU*/
Read back results to CPU and output;
Parallel Computation of WM
In practice, there are R + k  1 threads working in Kernel 1 in the finegrained concurrent way to calculate each row of the WM. Thus there are totally Q × (R + k  1) threads to compute the WM for each gene. It should be noted that the computation of the WM for each gene may span into different thread blocks because the maximum threads for each GPU block is limited and the size of the WM is varying with the size of the experiment M, the bin number R and the spline order k.
After all WMs are computed, Kernel 2 will be invoked to check the data integrity. Kernel 3 then follows the first forloop of Algorithm 2 to compute the self entropy for each gene. WMs and the self entropy calculated in Kernel 1 and 3 will be passed to Kernel 4 to compute the MI values for all gene pairs.
Parallel Computation of MI
Algorithm 5: CUDA Kernel 4
Input: WMs for each gene.
Output: Pairwise MI values.
foreach i, ≤ 1 ≤ i ≤ N  1 do
foreach j, i ≤ j ≤ N do
Assign (i, j)th pair of WM to one thread block;
R × R threads in each thread block work in parallel following the two forloops in Algorithm 2 to compute the joint WM by doing tiled matrix multiplication for the current WM pair;
One thread in this block is used to compute the joint entropy and MI;
end
end
In practice, a cyclic procedure is used in the kernel program to implement the tiled matrix multiplications. Each thread block first loads the current tile pair from global memory to shared memory. Then each thread computes one element of the tiled multiplication matrix. In the subsequent iteration, the next tile pair is loaded and multiplied. The tiled multiplication matrix will then be updated. This procedure will continue until all tile pairs are computed. The final tiled multiplication matrix is the joint WM we want to compute. At last, we calculate the MI value for the current gene pair. Algorithm 5 shows the pseudocode of our CUDA implementation of the parallel algorithm for Kernel 4.
Partitioning
Algorithm 6: CUDAbased MI estimation with partition
Input: N genes, each with M experiments.
Output: Pairwise MI values.
/*Host program executed on CPU*/
Initialize parameters controlling MI estimation;
Partition gene expression data into P groups ;
foreach i, 1 ≤ i ≤ P do
/*Kernel program executed on each thread*/
Load ith gene data group into the GPU device memory and compute WM for each gene using Algorithm 1;
Write the WMs to CPU RAM;
end
Partition WMs into Q groups;
foreach i, 1 ≤ i ≤ Q do
/*Kernel program executed on each thread*/
Load ith WM group into the GPU device memory and compute MI values using Kernel 2, Kernel 3 and Kernel 4;
Read MI values for current WM group back to CPU;
Write MI values to les;
end
The space complexity for storing all WMs is O(N × M × R). Using double precision floating point numbers this translates to 8 × N × M × R bytes of memory. The GPU global memory is not sufficient to load all WMs for parallel computation in the kernel for large datasets. Therefore, a method is required to partition the pairwise MI computation into a number of steps, where each step requires only a subset of WMs. By partitioning both the gene expression data and WMs into small groups, we can process large datasets using limited GPU device memory. Our partitioning method is illustrated in Algorithm 6.
The method shown in Algorithm 6 divides the gene expression data and WMs into smaller groups, which can be stored within the GPU global memory. Note that, the memory complexity of O(N × M × R) is dominated by the computational complexity of O(N^{2} × M × R^{2}). Therefore, the required data transfer time can be completely hidden by the computation time.
Results and Discussion
We have implemented the double precision CUDAMI using CUDA Toolkit 3.0 and evaluated it on the following CUDAenabled hardware:
Nvidia Tesla C2050: 1.15 GHz engine clock speed, 14 multiprocessors, 3 GB GDDR5 device memory, 48 KB shared memory/multiprocessor.
Tests have been conducted with this card installed in a PC with an Intel QuadCore i7920 2.66 GHz CPU, 12 GB RAM running Linux Fedora 10.
Datasets used for performance evaluation.
Dataset ID  Number of Genes  Number of Experiments 

nne2048_911  2048  911 
nne4096_911  4096  911 
nasc2048_2996  2048  2996 
nasc4096_2996  4096  2996 
Yeast_6221_80  6221  80 
Yeast_6307_215  6307  215 
S2000_1000  2000  1000 
S4000_1000  4000  1000 
S8000_1000  8000  1000 
S4000_3000  4000  3000 
S4000_4000  4000  4000 
S4000_5000  4000  5000 
S10000_2000  10000  2000 
S10000_3000  10000  3000 
S10000_4000  10000  4000 
Comparison of runtime (in seconds) between multithreaded MIBE (4 threads) and CUDAMI
ID  MIBE  CUDAMI  Speedup 

nne2048_911  220.87  14.78  14.9 
nne4096_911  867.35  45.17  19.2 
nasc2048_2996  2174.65  44.89  48.4 
nasc4096_2996  8645.28  153.46  56.3 
Yeast_6221_80  246.01  30  8.2 
Yeast_6307_215  490.46  42.18  11.6 
S2000_1000  229.52  14.58  15.7 
S4000_1000  904.86  50.89  17.8 
S8000_1000  3635.11  192.05  18.9 
S4000_3000  5960.87  148.09  40.3 
S4000_4000  7993.99  199.07  40.2 
S4000 5000  13790.19  256.76  53.7 
S10000 _2000  34315.69  520.12  66.0 
S10000 _3000  51705.18  781.73  66.1 
S10000 _4000  68799.74  838.74  82.0 
Comparison of runtimes (in seconds) for processing the nne 4096_911 dataset with various parameters.
Bin Number  Spline Order  MIBE (4 threads)  CUDAMI  Speedup 

10  3  867.35  45.17  19.2 
10  4  867.24  46.18  18.8 
10  5  868.3  46.54  18.7 
15  3  1926.67  60.02  32.1 
15  4  1947.45  60.83  32.0 
15  5  1949.65  61.77  31.6 
20  3  3468.07  104.45  33.2 
20  4  3657.64  105.40  34.7 
20  5  3658.83  105.61  34.6 
1. The speedup does not change significantly with a larger value of spline order.
2. The speedup improves significantly with a larger number of bins.
The reason for Observation 1 is that the spline order k does not have much impact on the overall runtime of the MI estimation algorithm. On the contrary because of the quadratic item R^{2} in the Big O notion, the value of R influences the performance greatly. From Algorithm 5 we can see that in CUDAMI totally R × R threads are used. This means a larger number of threads are used with a larger number of bins. Therefore CUDAMI can work more efficiently with larger number of bins R, which explains Observation 2. We also have compared the runtime performance of CUDAMI to the MPI based TINGe software [23] running with MPI installed on an Intel QuadCore i7920 and eight MPI processes for the datasets shown in Table 1. For all tested datasets except the nasc 2048_2996 dataset, MIBE is able to outperform TINGe on the same hardware. Thus, we have decided only to include a runtime comparison between CUDAMI and MIBE in this paper.
GFlops and GPU runtime (in Sec.) of Kernel 4 for dataset nne 4096_911 with fixed spline order and variable bin number.
Bin Number  Runtime  Percentage  GFlops 

10  13.24  61.39%  115.41 
15  22.96  71.9%  149.74 
20  36.93  79.38%  165.56 
Comparison of ARACNE, TINGe and CARACNE using synthetic networks.
Time (Sec.)  Specificity  Sensitivity  Precision  TP  TN  FP  FN  

m = 500  ARACNE  188.28  0.995  0.255  0.373  98  30575  165  287 
TINGe  3.74  0.995  0.257  0.404  99  30594  146  286  
CARACNE  1.07  0.996  0.291  0.448  112  30602  138  273  
m = 900  ARACNE  637.56  0.994  0.27  0.378  104  30569  171  281 
TINGe  3.9  0.995  0.265  0.394  102  30583  157  283  
CARACNE  1.3  0.996  0.294  0.454  113  30604  136  272  
m = 1200  ARACNE  1312.3  0.995  0.283  0.394  109  30572  168  276 
TINGe  4.86  0.995  0.273  0.4  105  30581  159  280  
CARACNE  1.46  0.995  0.296  0.44  114  30595  145  271 
The results for inferring GRNs using ARACNE, TINGe, and CARACNE are shown in Table 5. From Table 5 we can see that CARACNE achieves the best performance in terms of both runtime and quality of inferred GRNs.
Conclusions
In this paper we have proposed a CUDAbased parallel algorithm  CUDAMI for accelerating MI estimation using the Bspline function. In order to exploit the GPU's capabilities to accelerate MI estimation, we have used the fast shared memory, finegrained parallelism, and partitioning to implement our algorithm. Our implementation achieves speedups up to 82 compared to the multithreaded MIBE on a modern Intel quadcore. This result indicates that CUDAenabled architectures are a highly efficient hardware platform for this type computation. We also have used the output of CUDAMI to infer GRNs. Our experiments show that compared to ARACNE and TINGe, CUDAMI can achieve better performance in terms of both runtime and inferred GRNs' quality for synthetic datasets.
Availability and requirements

Project name: CUDAMI

Project home page: https://sites.google.com/site/liuweiguohome/cudami

Operating System: Linux

Programming language: CUDA and C

Other requirements: CUDA SDK and Toolkits 3.0 or higher, CUDAenabled GPU with at least 3 G memory.

License: none
Declarations
Authors’ Affiliations
References
 Fraser AM, Swinney HL: Independent coordinates for strange attractors from mutual information. Physical Review A. 1986, 33: 23182321.View ArticleGoogle Scholar
 Pluim JPW, Maintz JBA, Viergever MA: Mutualinformationbased registration of medical images: a survey. IEEE Transactions on Medical Imaging. 2003, 22: 9861004. 10.1109/TMI.2003.815867.PubMedView ArticleGoogle Scholar
 Tebmann M, Eisenacher C, Enders F, Stamminger M, Hastreiter P: GPU accelerated normalized mutual information and Bspline transformation. Eurographics Workshop on Visual Computing for Biomedicine. 2008, Eurographics Association, 117124.Google Scholar
 Arsic I, Thiran JP: Mutual information eigenlips for audiovisual speech recognition. Proc 14th Eur Signal Processing Conf (EUSIPCO). 2006Google Scholar
 Zhou X, Wang X, Dougherty ER: Construction of genomic networks using mutualinformation clustering and reversiblejump MarkovchainMonteCarlo predictor design. Signal Processing. 2003, 83: 745761. 10.1016/S01651684(02)004693.View ArticleGoogle Scholar
 Zhou X, Wang X, Dougherty ER, Russ D, Suh E: Gene Clustering Based on Clusterwide Mutual Information. Journal of Computational Biology. 2004, 11: 147161. 10.1089/106652704773416939.PubMedView ArticleGoogle Scholar
 Daub CO, Steuer R, Selbig J, Kloska S: Estimating mutual information using Bspline functionsan improved similarity measure for analysing gene expression data. 2004, 5:Google Scholar
 Zola J, Aluru M, Aluru S: Parallel information theory based construction of gene regulatory networks. Hipc. 2008, 336349.Google Scholar
 Butte AJ, Kohane IS: Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pacific Symposium on Biocomputing. 2000, 415426.Google Scholar
 Schäfer J, Strimmer K: An empirical Bayes approach to inferring largescale gene association networks. Bioinformatics. 2005, 21 (6): 754764. 10.1093/bioinformatics/bti062.PubMedView ArticleGoogle Scholar
 D'Haeseleer P, Wen X, Fuhrman S, Somogyi R: Mining the gene expression matrix: Inferring gene relationshops from large scale gene expression data. Second International Workshop on Information Processing in Cells and Tissues. 1998, 203212.View ArticleGoogle Scholar
 Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data. Journal of Computational Biology. 2000, 7: 601620. 10.1089/106652700750050961.PubMedView ArticleGoogle Scholar
 Chen X, Chen M, Ning K: BNArray: an R package for constructing gene regulatory networks from microarray data by using Bayesian network. Bioinformatics Application Note. 2006, 22: 29522954.View ArticleGoogle Scholar
 Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Favera RD, Califano A: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context. BMC Bioinformatics. 2006, 7 (S7):
 Wilson J, Dai M, Jakupovic E, Watson S, Meng F: Supercomputing with toys: harnessing the power of NVIDIA 8800GTX and playstation 3 for bioinformatics problems. Comput Syst Bioinformatics Conf. 2007, 387390.View ArticleGoogle Scholar
 Lindholm E, Nickolls J, Oberman S, Montrym J: NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro. 2008, 28: 4052.View ArticleGoogle Scholar
 Nvidia: NvidiaFermiArchitecture. [http://www.nvidia.com/object/fermi_architecture.html]
 Manavski SA, Valle G: CUDA compatible GPU cards as efficient hardware accelerators for SmithWaterman sequence alignment. BMC Bioinformatics. 2008, 9:Google Scholar
 Schatz MC, Trapnell C, Delcher AL, Varshney A: Highthroughput sequence alignment using graphics processing units. BMC Bioinformatics. 2007, 8 (474):
 Liu Y, Maskell DL, Schmidt B: CUDASW++: optimizing SmithWaterman sequence database searches for CUDAenabled graphics processing units. BMC Research Notes. 2009, 2 (73):
 Liu Y, Schmidt B, Maskell DL: CUDASW++2.0: enhanced SmithWaterman protein database search on CUDAenabled GPUs based on SIMT and virtualized SIMD abstractions. BMC Research Notes. 2010, 3 (93):
 Liu W, Schmidt B, Voss G, MüllerWittig W: Accelerating Molecular Dynamics simulations using Graphics Processing Units with CUDA. Computer Physics Communications. 2008, 179: 634641. 10.1016/j.cpc.2008.05.008.View ArticleGoogle Scholar
 Zola J, Aluru M, Sarje A, Aluru S: Parallel Information Theory Based Construction of Genomewide Gene Regulatory Networks. IEEE Transactions on Parallel and Distributed Systems. 2010, 21: 17211733.View ArticleGoogle Scholar
 CUDA N: NVIDIA CUDA C Programming Guide Version 3.1.1. 2010Google Scholar
 den Bulcke TV, Leemput KV, Naudts B, van Remortel P, Ma H, Verschoren A, Moor BD, Marchal K: SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinformatics. 2006, 7 (43):
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.