Fast reconstruction of 3D volumes from 2D CT projection data with GPUs
© Leeser et al.; licensee BioMed Central Ltd. 2014
Received: 3 March 2014
Accepted: 18 August 2014
Published: 30 August 2014
Skip to main content
© Leeser et al.; licensee BioMed Central Ltd. 2014
Received: 3 March 2014
Accepted: 18 August 2014
Published: 30 August 2014
Biomedical image reconstruction applications require producing high fidelity images in or close to real-time. We have implemented reconstruction of three dimensional conebeam computed tomography(CBCT) with two dimensional projections. The algorithm takes slices of the target, weights and filters them to backproject the data, then creates the final 3D volume. We have implemented the algorithm using several hardware and software approaches and taken advantage of different types of parallelism in modern processors. The two hardware platforms used are a Central Processing Unit (CPU) and a heterogeneous system with a combination of CPU and GPU. On the CPU we implement serial MATLAB, parallel MATLAB, C and parallel C with OpenMP extensions. These codes are compared against the heterogeneous versions written in CUDA-C and OpenCL.
Our results show that GPUs are particularly well suited to accelerating CBCT. Relative performance was evaluated on a mathematical phantom as well as on mouse data. Speedups of up to 200x are observed by using an AMD GPU compared to a parallel version in C with OpenMP constructs.
In this paper, we have implemented the Feldkamp-Davis-Kress algorithm, compatible with Fessler’s image reconstruction toolbox and tested it on different hardware platforms including CPU and a combination of CPU and GPU. Both NVIDIA and AMD GPUs have been used for performance evaluation. GPUs provide significant speedup over the parallel CPU version.
CT imaging is one of the most used diagnostic methods in interventional and minimally invasive surgeries. As the importance of the access to medical imagery before or during surgical procedures increases, the computational need for CT imaging becomes more demanding and challenging. It requires producing high fidelity images in or close to real-time to avoid interruptions during the treatment of patients. Conebeam CT is used to acquire knowledge of parts of the human body to obtain a clear image during/before performing a procedure. Today, most conebeam CT scanners use the Feldkamp Davis Kress algorithm as the standard reconstruction method. The method takes a slice of the target, weights the projection data and then filters the weighted data before backprojecting and creating the final three dimensional image. The last step, backprojection, is the most computationally intensive with a complexity of O(N4) in the spatial domain and it is the bottleneck. Researchers have used different architectures to accelerate this process including Application Specific Integrated Circuits (ASICs) and Field Programmable Gate Arrays (FPGAs). However, the expensive nature of these boards along with the steep learning curve necessary to program these devices often limit their use. Graphics Processing Units (GPUs) offer an alternative approach for accelerating computationally intensive jobs. Algorithms such as CT image reconstruction with intensive computation and massive data parallelism are particularly well suited for GPUs.
A popular image reconstruction toolbox, provided by Fessler, consists of a collection of open source algorithms for image reconstruction written in MATLAB. We have implemented the FDK algorithm from this toolbox using several different methods including single threaded code written in C, parallel code written in C with OpenMP constructs, parallel code in MATLAB using the parallel computing toolbox (PCT) and GPU codes written in CUDA-C and OpenCL. The purpose of this study is to explore the performance of these implementations on different architectures. These codes are run on two types of architectures including CPU and a combination of CPU and GPU. We have tested our implementations on both NVIDIA and AMD GPUs using both a mathematical phantom and mouse scan data.
The main contributions of this paper are:
Our implementations are compatible with Fessler’s image reconstruction toolbox, a popular toolbox of open source algorithms for reconstruction of images written in MATLAB. We use the same input files and same general approach as Fessler in our implementations.
Our implementations are tested on two types of hardware platforms: CPU and a combination of CPU and GPU. The performance has been evaluated using GPUs from two different vendors: NVIDIA and AMD.
The performances of two complete GPU implementations of the same approach are compared, in CUDA-C and OpenCL, to serial and multithreaded C and MATLAB implementations.
Our NVIDIA CUDA code is compatible with NVIDIA’s CUDA compiler, while other open source software is not. Our OpenCL implementation is optimized and complete.
Our code is available open source.
This section describes the FDK method along with a brief introduction to GPU computing and recent advances in the GPU computing model. We also discuss related work that aims at accelerating FDK using GPU, CPU, or other heterogeneous architectures.
The FDK method, published by Feldkamp, Davis and Kramp in 1984, introduced a method to reconstruct a 3D volume from multiple 2D projections. Here a scanner along with a 2D detector takes a full rotation around the patient or object of interest to capture the data. In this process, called conebeam scanning, the trajectory of the source is circular and each horizontal row of detector values is ramp filtered and considered as a two dimensional object. These filtered projections are then backprojected along the original rays. During the process of acquiring scanned data, the X-ray source moves in a circular orbital path, which has a radius r. The detector plane stands perpendicular to the rotational axis of the source and moves with it. It produces a set of projections P1,P2,…,P K at K discrete positions of the source with uniform angular spacing. Sometimes there are mechanical limitations that preclude a full rotation from being completed.
For many algorithms with massive parallelism, GPUs provided higher peak performance than CPUs. Initially GPUs were designed for processing graphics applications and games, but they have been increasingly used for scientific computing and biomedical applications such as Smith-Waterman alignment algorithm, protein folding, DNA sequencing, statistical phylogenetics, molecular dynamics, diffuse optical tomography and biological systems simulation[7–13].
GPUs have many parallel cores that run simultaneously and each core can run multiple threads. CT reconstruction has inherent features that can be parallelized. The sequential parts can be run on the CPU and the computationally intensive parallel parts can be accelerated on the GPU.
We have implemented FDK using two GPGPU languages: OpenCL and CUDA-C. While CUDA-C runs only on NVIDIA hardware, OpenCL is platform independent and runs on several hardware architectures including AMD, Intel, and NVIDIA. NVIDIA provides optimized libraries along with CUDA-C, which often results in better performance. Both CUDA-C and OpenCL support heterogeneous computing with separate host and device code. Both languages require minimal extensions to C/C++ programs. The accuracy of the results is of paramount importance in biomedical applications. We have shown that the results provided by GPU may have better precision over serial CPU code for floating point values.
There are several areas to explore to make the reconstruction faster. The first is to use a different algorithm. Authors have used this approach to obtain around 40 times speed up of reconstruction over traditional filtered backprojection[15, 16]. However, the quality of the reconstruction has been questioned. Another area is to explore different parallel techniques and architectures. The intrinsic parallel nature of the algorithm makes it amenable to hardware acceleration for real-time processing. A popular hardware platform for parallel processing is to use GPUs. Attempts to use GPU hardware to accelerate CT algorithms date back to the early 90s when texture mapping hardware was used for 3D reconstruction. Later Mueller and Xu used a GPU to accelerate backprojection by using accelerated graphics components[3, 19]. Zhao et al. introduced an idea to allow larger datasets to fit in GPU memory. Noel et al. used device memory to transfer all images and calculate intensity of a voxel. However accessing this memory can have long latency, so to avoid it, Knaup et al. divided the total data into chunks to fit in shared memory. Mueller et al. divided the processing by doing convolution on the CPU and backprojection on the GPU to reconstruct faster. The most similar work to ours is the Reconstruction Toolkit (RTK), based on the Insight Toolkit. Our approach is completely stand alone and does not require ITK or any other packages to operate. It makes use of the same inputs as those used by Fessler. Our CUDA-C code is compatible with nvcc, the NVIDIA C Compiler while that from RTK is not. Our OpenCL implementation is as optimized as the CUDA-C version and in fact produces superior results.
This paper presents a more complete and consistent set of experiments and results than our previously published work. All experiments in this paper are done on the same hardware for better comparison; the hardware is described in Section ‘Experimental results’. The OpenCL version is more advanced than in our previous publication and the best OpenCL implementation of backprojection available. The software described is now available for download.
We have several implementation of backprojection: 1) the MATLAB code originally writtend by Fessler et al., 2) a version of Fessler’s code parallelized with MATLAB Parallel Computing Toolbox (PCT) 3) a serial implementation written in C, 4) the C implementation parallelized with OpenMP constructs, 5) a version that uses a combination of CPU and GPU written in CUDA-C that compiles with the NVIDIA compiler, nvcc, and 6) a version that uses a combination of CPU and GPU written in OpenCL.
We have implemented the FDK method in a basic processing chain in a pipelined fashion. The steps in the pipeline are: 1) load projection data, 2) ramp filter the weighted data and 3) backproject it to the final volume. Note that the structure of our code follows that of Fessler’s implementation. The input and output formats are also the same.
For the GPU implementations, different kernels are launched for different stages. Although the kernel calls are issued in a non-blocking manner, they are executed in series as each step needs to complete before the next can begin. In the filtering stage, different pixels for the same projection can be simultaneously filtered as there is no dependency between pixels. The filtering stage uses a Fast Fourier Transform (FFT). In the CUDA code we used the CUFFT available from NVIDIA while for OpenCL we use Apple’s FFT package. The final step is to calculate voxel-based backprojection. Here each voxel is calculated in parallel by performing a matrix-vector product for each voxel in order to determine the corresponding projection value (see Equation 1). After all projections have been processed and mapped to the appropriate voxel, the final reconstructed volume is transferred to host memory. As memory transfer from host to device is expensive, transferring all the data to the GPU before the start of computation and transferring back the result after the final volume is reconstructed saves data transfer cycles. We use asynchronous data transfers to overlap data transfer with computation.
The input and output sizes of the mathematical phantom are both 1MB. For the mouse scan, the sizes of the input and output projections are 542 MB and 768 MB respectively. Note that the code and therefore the run time only depend on the size of the data, not the content.
Performance of different implementations (in seconds)
Speedup over MATLAB
Speedup over C
C + OpenMP (4 threads)
C + OpenMP
We have implemented the FDK algorithm, compatible with Fessler’s image reconstruction toolbox and tested on two different architectures: CPU and a combination of CPU and GPU. Both NVIDIA and AMD GPUs have been used for performance evaluation. The performance of two GPU implementations in CUDA-C and OpenCL have been compared to MATLAB, Multithreaded MATLAB, and serial and multi-threaded C. The OpenCL implementation on the AMD card yields the largest speed up of 200x over multi-threaded C and three orders of magnitude over the original MATLAB code.
In the future, we will continue to improve our approach. After parallelizing backprojection, the new bottleneck is weighted filtering. We plan to investigate improved performance for the filtering stage. In addition, for the GPU implementations, only a subset of the number of launch configurations for kernels have been tested so far. The number of threads have been arbitrarily chosen. These issues will be investigated with auto-tuning. The data sizes that have been tested so far can be accommodated in the GPU memory, but for larger data sizes, streaming needs to be added to the current implementation. We plan to do so in future versions of the open source code.
Project name: Accelerating 3D CBCT with GPU
Project home page: http://sourceforge.net/projects/acceleratecbct/
Operating system(s): Linux
Programming language: C with OpenMP, CUDA, OpenCL.
Other requirements: CUDA compiler installed
Any restrictions to use by non-academics: Only those imposed already by the license.
All materials are available online. The source codes as well as input data phantom are released into the public domain. The documentation for the software pipeline is also included. This is available as Open Source software under the General Public License (GPL) version 2.0. as a part of the open source software.
Compute unified device architecture
Advanced micro devices, Inc.
Graphics processing unit
Parallel computing toolbox.
This work was supported in part by the National Science Foundation Engineering Research Centers Innovations Program, Biomedical Imaging Acceleration Testbench (Award Number EEC-0946463), by National Science Foundation grant CCF-1218075 and by gifts from Mathworks and NVIDIA.
We thank Dr. Nicholas Moore for his helpful input, and Drs. Ralph Weissleder and Sarit Sekhar Agasthi, Massachusetts General Hospital for providing the mouse scan data.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.