Fast reconstruction of 3D volumes from 2D CT projection data with GPUs
 Miriam Leeser^{1}Email author,
 Saoni Mukherjee^{1} and
 James Brock^{2}
DOI: 10.1186/175605007582
© Leeser et al.; licensee BioMed Central Ltd. 2014
Received: 3 March 2014
Accepted: 18 August 2014
Published: 30 August 2014
Abstract
Background
Biomedical image reconstruction applications require producing high fidelity images in or close to realtime. We have implemented reconstruction of three dimensional conebeam computed tomography(CBCT) with two dimensional projections. The algorithm takes slices of the target, weights and filters them to backproject the data, then creates the final 3D volume. We have implemented the algorithm using several hardware and software approaches and taken advantage of different types of parallelism in modern processors. The two hardware platforms used are a Central Processing Unit (CPU) and a heterogeneous system with a combination of CPU and GPU. On the CPU we implement serial MATLAB, parallel MATLAB, C and parallel C with OpenMP extensions. These codes are compared against the heterogeneous versions written in CUDAC and OpenCL.
Findings
Our results show that GPUs are particularly well suited to accelerating CBCT. Relative performance was evaluated on a mathematical phantom as well as on mouse data. Speedups of up to 200x are observed by using an AMD GPU compared to a parallel version in C with OpenMP constructs.
Conclusions
In this paper, we have implemented the FeldkampDavisKress algorithm, compatible with Fessler’s image reconstruction toolbox and tested it on different hardware platforms including CPU and a combination of CPU and GPU. Both NVIDIA and AMD GPUs have been used for performance evaluation. GPUs provide significant speedup over the parallel CPU version.
Keywords
Computed tomography Graphics processing unit Conebeam reconstruction CUDA OpenCLFindings
Introduction
CT imaging is one of the most used diagnostic methods in interventional and minimally invasive surgeries[1]. As the importance of the access to medical imagery before or during surgical procedures increases, the computational need for CT imaging becomes more demanding and challenging. It requires producing high fidelity images in or close to realtime to avoid interruptions during the treatment of patients. Conebeam CT is used to acquire knowledge of parts of the human body to obtain a clear image during/before performing a procedure. Today, most conebeam CT scanners use the Feldkamp Davis Kress algorithm[2] as the standard reconstruction method. The method takes a slice of the target, weights the projection data and then filters the weighted data before backprojecting and creating the final three dimensional image. The last step, backprojection, is the most computationally intensive with a complexity of O(N^{4}) in the spatial domain and it is the bottleneck[3]. Researchers have used different architectures to accelerate this process including Application Specific Integrated Circuits (ASICs) and Field Programmable Gate Arrays (FPGAs). However, the expensive nature of these boards along with the steep learning curve necessary to program these devices often limit their use. Graphics Processing Units (GPUs) offer an alternative approach for accelerating computationally intensive jobs. Algorithms such as CT image reconstruction with intensive computation and massive data parallelism are particularly well suited for GPUs.
A popular image reconstruction toolbox, provided by Fessler[4], consists of a collection of open source algorithms for image reconstruction written in MATLAB. We have implemented the FDK algorithm from this toolbox using several different methods including single threaded code written in C, parallel code written in C with OpenMP constructs, parallel code in MATLAB using the parallel computing toolbox (PCT) and GPU codes written in CUDAC and OpenCL. The purpose of this study is to explore the performance of these implementations on different architectures. These codes are run on two types of architectures including CPU and a combination of CPU and GPU. We have tested our implementations on both NVIDIA and AMD GPUs using both a mathematical phantom and mouse scan data.
The main contributions of this paper are:

Our implementations are compatible with Fessler’s image reconstruction toolbox[4], a popular toolbox of open source algorithms for reconstruction of images written in MATLAB. We use the same input files and same general approach as Fessler in our implementations.

Our implementations are tested on two types of hardware platforms: CPU and a combination of CPU and GPU. The performance has been evaluated using GPUs from two different vendors: NVIDIA and AMD.

The performances of two complete GPU implementations of the same approach are compared, in CUDAC and OpenCL, to serial and multithreaded C and MATLAB implementations.

Our NVIDIA CUDA code is compatible with NVIDIA’s CUDA compiler, while other open source software is not. Our OpenCL implementation is optimized and complete.
Our code is available open source[5].
Background
This section describes the FDK method along with a brief introduction to GPU computing and recent advances in the GPU computing model. We also discuss related work that aims at accelerating FDK using GPU, CPU, or other heterogeneous architectures.
FDK method
The FDK method, published by Feldkamp, Davis and Kramp in 1984[2], introduced a method to reconstruct a 3D volume from multiple 2D projections. Here a scanner along with a 2D detector takes a full rotation around the patient or object of interest to capture the data. In this process, called conebeam scanning, the trajectory of the source is circular and each horizontal row of detector values is ramp filtered and considered as a two dimensional object. These filtered projections are then backprojected along the original rays. During the process of acquiring scanned data, the Xray source moves in a circular orbital path, which has a radius r. The detector plane stands perpendicular to the rotational axis of the source and moves with it. It produces a set of projections P_{1},P_{2},…,P_{ K } at K discrete positions of the source with uniform angular spacing. Sometimes there are mechanical limitations that preclude a full rotation from being completed.
GPU computing
For many algorithms with massive parallelism, GPUs provided higher peak performance than CPUs. Initially GPUs were designed for processing graphics applications and games, but they have been increasingly used for scientific computing and biomedical applications such as SmithWaterman alignment algorithm, protein folding, DNA sequencing, statistical phylogenetics, molecular dynamics, diffuse optical tomography and biological systems simulation[7–13].
GPUs have many parallel cores that run simultaneously and each core can run multiple threads. CT reconstruction has inherent features that can be parallelized. The sequential parts can be run on the CPU and the computationally intensive parallel parts can be accelerated on the GPU.
We have implemented FDK using two GPGPU languages: OpenCL and CUDAC. While CUDAC runs only on NVIDIA hardware, OpenCL is platform independent and runs on several hardware architectures including AMD, Intel, and NVIDIA. NVIDIA provides optimized libraries along with CUDAC, which often results in better performance. Both CUDAC and OpenCL support heterogeneous computing with separate host and device code. Both languages require minimal extensions to C/C++ programs. The accuracy of the results is of paramount importance in biomedical applications. We have shown that the results provided by GPU may have better precision over serial CPU code for floating point values[14].
Related work
There are several areas to explore to make the reconstruction faster. The first is to use a different algorithm. Authors have used this approach to obtain around 40 times speed up of reconstruction over traditional filtered backprojection[15, 16]. However, the quality of the reconstruction has been questioned[17]. Another area is to explore different parallel techniques and architectures. The intrinsic parallel nature of the algorithm makes it amenable to hardware acceleration for realtime processing. A popular hardware platform for parallel processing is to use GPUs. Attempts to use GPU hardware to accelerate CT algorithms date back to the early 90s when texture mapping hardware was used for 3D reconstruction[18]. Later Mueller and Xu used a GPU to accelerate backprojection by using accelerated graphics components[3, 19]. Zhao et al.[1] introduced an idea to allow larger datasets to fit in GPU memory. Noel et al.[20] used device memory to transfer all images and calculate intensity of a voxel. However accessing this memory can have long latency, so to avoid it, Knaup et al.[21] divided the total data into chunks to fit in shared memory. Mueller et al.[22] divided the processing by doing convolution on the CPU and backprojection on the GPU to reconstruct faster. The most similar work to ours is the Reconstruction Toolkit (RTK)[23], based on the Insight Toolkit[24]. Our approach is completely stand alone and does not require ITK or any other packages to operate. It makes use of the same inputs as those used by Fessler. Our CUDAC code is compatible with nvcc, the NVIDIA C Compiler while that from RTK is not. Our OpenCL implementation is as optimized as the CUDAC version and in fact produces superior results.
This paper presents a more complete and consistent set of experiments and results than our previously published work[26]. All experiments in this paper are done on the same hardware for better comparison; the hardware is described in Section ‘Experimental results’. The OpenCL version is more advanced than in our previous publication and the best OpenCL implementation of backprojection available. The software described is now available for download[5].
Implementations
We have several implementation of backprojection: 1) the MATLAB code originally writtend by Fessler et al.[4], 2) a version of Fessler’s code parallelized with MATLAB Parallel Computing Toolbox (PCT) 3) a serial implementation written in C, 4) the C implementation parallelized with OpenMP constructs[27], 5) a version that uses a combination of CPU and GPU written in CUDAC that compiles with the NVIDIA compiler, nvcc, and 6) a version that uses a combination of CPU and GPU written in OpenCL.
We have implemented the FDK method in a basic processing chain in a pipelined fashion. The steps in the pipeline are: 1) load projection data, 2) ramp filter the weighted data and 3) backproject it to the final volume. Note that the structure of our code follows that of Fessler’s implementation. The input and output formats are also the same.
For the GPU implementations, different kernels are launched for different stages. Although the kernel calls are issued in a nonblocking manner, they are executed in series as each step needs to complete before the next can begin. In the filtering stage, different pixels for the same projection can be simultaneously filtered as there is no dependency between pixels. The filtering stage uses a Fast Fourier Transform (FFT). In the CUDA code we used the CUFFT available from NVIDIA[28] while for OpenCL we use Apple’s FFT package. The final step is to calculate voxelbased backprojection. Here each voxel is calculated in parallel by performing a matrixvector product for each voxel in order to determine the corresponding projection value (see Equation 1). After all projections have been processed and mapped to the appropriate voxel, the final reconstructed volume is transferred to host memory. As memory transfer from host to device is expensive, transferring all the data to the GPU before the start of computation and transferring back the result after the final volume is reconstructed saves data transfer cycles. We use asynchronous data transfers to overlap data transfer with computation.
Experimental results
Hardware details
Processor  Clock  Number of  Cache  Memory 

speed  cores  size  size  
Intel Xeon  2.00 GHz  6  15 MB  32 GB 
E52620  
NVIDIA Tesla  1.15 GHz  448  768 KB  6 GB 
C2075  
AMD Radeon  925 MHz  2048  768 KB  3 GB 
HD 7970 
The input and output sizes of the mathematical phantom are both 1MB. For the mouse scan, the sizes of the input and output projections are 542 MB and 768 MB respectively. Note that the code and therefore the run time only depend on the size of the data, not the content.
Performance of different implementations (in seconds)
Dataset  Approach  Backprojection time  Total time  Speedup over MATLAB  Speedup over C 

Phantom  MATLAB  51.06  51.11  –  – 
Phantom  C  3.93  3.95  12.94  – 
Phantom  C + OpenMP (4 threads)  0.85  0.89  57.43  4.44 
Phantom  OpenCL (NVIDIA)  0.01  0.30  170.37  13.17 
Phantom  CUDA (NVIDIA)  0.01  0.30  170.37  13.17 
Phantom  OpenCL (AMD)  0.01  0.32  159.72  12.34 
Mouse scan  MATLAB  33760.40  33777.33  –  – 
Mouse scan  MATLAB PCT  22506.49  22513.90  1.5  – 
Mouse scan  C  18451.77  18462.60  1.83  – 
Mouse scan  C + OpenMP  5112.94  5615.65  6.01  3.29 
Mouse scan  OpenCL (NVIDIA)  49.44  60.45  558.76  305.42 
Mouse scan  CUDA (NVIDIA)  47.79  58.87  573.76  313.62 
Mouse scan  OpenCL(AMD)  16.01  28.02  1205.47  658.91 
Conclusions
We have implemented the FDK algorithm[2], compatible with Fessler’s image reconstruction toolbox[4] and tested on two different architectures: CPU and a combination of CPU and GPU. Both NVIDIA and AMD GPUs have been used for performance evaluation. The performance of two GPU implementations in CUDAC and OpenCL have been compared to MATLAB, Multithreaded MATLAB, and serial and multithreaded C. The OpenCL implementation on the AMD card yields the largest speed up of 200x over multithreaded C and three orders of magnitude over the original MATLAB code.
In the future, we will continue to improve our approach. After parallelizing backprojection, the new bottleneck is weighted filtering. We plan to investigate improved performance for the filtering stage. In addition, for the GPU implementations, only a subset of the number of launch configurations for kernels have been tested so far. The number of threads have been arbitrarily chosen. These issues will be investigated with autotuning. The data sizes that have been tested so far can be accommodated in the GPU memory, but for larger data sizes, streaming needs to be added to the current implementation. We plan to do so in future versions of the open source code.
Availability and requirements

Project name: Accelerating 3D CBCT with GPU

Project home page: http://sourceforge.net/projects/acceleratecbct/

Operating system(s): Linux

Programming language: C with OpenMP, CUDA, OpenCL.

Other requirements: CUDA compiler installed

License: GPL

Any restrictions to use by nonacademics: Only those imposed already by the license.
Availability of supporting data
All materials are available online. The source codes as well as input data phantom are released into the public domain. The documentation for the software pipeline is also included. This is available as Open Source software under the General Public License (GPL) version 2.0. as a part of the open source software[5].
Abbreviations
 CUDA:

Compute unified device architecture
 CT:

Computed tomography
 AMD:

Advanced micro devices, Inc.
 FDK:

Feldkamp–Davis–Kress
 GPU:

Graphics processing unit
 PCT:

Parallel computing toolbox.
Declarations
Acknowledgements
This work was supported in part by the National Science Foundation Engineering Research Centers Innovations Program, Biomedical Imaging Acceleration Testbench (Award Number EEC0946463), by National Science Foundation grant CCF1218075 and by gifts from Mathworks and NVIDIA.
We thank Dr. Nicholas Moore for his helpful input, and Drs. Ralph Weissleder and Sarit Sekhar Agasthi, Massachusetts General Hospital for providing the mouse scan data.
Authors’ Affiliations
References
 Zhao X, Hu Jj, Zhang P:Gpubased 3d cone beam ct image reconstruction for large data volume. Int J Biomed Imaging. 2009, 2009: 149079PubMedPubMed CentralView Article
 Feldkamp LA, Davis LC, Kress JW:Practical conebeam algorithm. J Opt Soc Am. 1984, 1: 612619. 10.1364/JOSAA.1.000612.View Article
 Mueller K, Xu F:Practical consideration for gpuaccelerated ct. IEEE Int Symp Biomed Imaging. 2006, 11: 1184
 Fessler J:Image reconstruction toolbox. [http://web.eecs.umich.edu/~fessler/irt/fessler.tgz],
 Cbct open source software. [http://www.coe.neu.edu/Research/rcl//projects/CBCT.php],
 Ino F, Yoshida S, Hagihara K:RGBA packing for fast cone beam reconstruction on the GPU. SPIE Medical Imaging. 2009, International Society for Optics and Photonics, 725858725858.
 de O Sandes EF, de Melo AC:Retrieving smithwaterman alignments with optimizations for megabase biological sequences using GPU. Parallel Distributed Syst IEEE Trans. 2013, 24 (5): 10091021.View Article
 Sukhwani B, Herbordt MC:Gpu acceleration of a production molecular docking code. Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units. 2009, New York: ACM, 1927.
 Liu CM, Wong T, Wu E, Luo R, Yiu SM, Li Y, Wang B, Yu C, Chu X, Zhao K, Li R, Lam TW:SOAP3: ultrafast gpubased parallel alignment tool for short reads. Bioinformatics. 2012, 28 (6): 878879. 10.1093/bioinformatics/bts061.PubMedView Article
 Ayres DL, Darling A, Zwickl DJ, Beerli P, Holder MT, Lewis PO, Huelsenbeck JP, Ronquist F, Swofford DL, Cummings MP, Rambaut A, Suchard MA:Beagle: An application programming interface and highperformance computing library for statistical phylogenetics. System Biol. 2012, 61 (1): 17010.1093/sysbio/syr100.View Article
 Jie L, Li K, Shi L, Liu R, Mei J:Accelerating solidification process simulation for largesized system of liquid metal atoms using GPU with CUDA. J Comput Phys. 2014, 257 Part A (0): 521535.View Article
 Valim N, Brock J, Leeser M, Niedre M:The effect of temporal impulse response on experimental reduction of photon scatter in timeresolved diffuse optical tomography. Phys Med Biol. 2013, 58 (2): 33510.1088/00319155/58/2/335.PubMedPubMed CentralView Article
 Okuyama T, Okita M, Abe T, Asai Y, Kitano H, Nomura T, Hagihara K:Accelerating ODEbased simulation of general and heterogeneous biophysical models using a GPU. Parallel Distributed Syst EEE Trans. 2014, 25 (8): 19661975.View Article
 Leeser M, Ramachandran J, Wahl T, Yablonski D:OpenCL floating point software on heterogeneous architectures–portable or not. Workshop on Numerical Software Verification (NSV). 2012, Available from [http://www.ccs.neu.edu/home/wahl/Research/FPAHeterogeneous/],
 Xiao S, Bresler Y, Munson Jr DC:Fast Feldkamp algorithm for conebeam computer tomography. Image Processing, International Conference on (ICIP), Volume 2. 2003, New York: IEEE, 819819.
 Basu S, Bresler Y:O(n2log2n) filtered backprojection reconstruction algorithm for tomography. IEEE Trans Image Process. 2000, 9: 10
 Rodet T, Noo F, Defrise M:The conebeam algorithm of feldkamp, davis, and kress preserves oblique line integrals. Med Phys. 2004, 31: 197210.1118/1.1759828.PubMedView Article
 Cabral B, Cam N, Foran J:Accelerated volume rendering and tomographic reconstruction using texture mapping hardware. Proceedings of the 1994 Symposium on Volume Visualization. 1994, New York: ACM, 9198.View Article
 Mueller K, Xu F:Realtime 3d computed tomographic reconstruction using commodity graphics hardware. Phys Med Biol. 2007, 52: 340510.1088/00319155/52/12/006.PubMedView Article
 Noël PB, Walczak AM, Xu J, Corso JJ, Hoffmann KR, Schafer S:GPUbased cone beam computed tomography. Comput Methods Prog Biomed. 2010, 98 (3): 271277. 10.1016/j.cmpb.2009.08.006.View Article
 Knaup M, Steckmann S, Kachelriess M:GPUbased parallelbeam and conebeam forwardand backprojection using CUDA. Nuclear Science Symposium Conference Record (NSS). 2008, New York: IEEE, 51535157.
 Mueller K, Xu F, Neophytou N:Why do commodity graphics hardware boards (GPUs) work so well for acceleration of computed tomography?. Electronic Imaging 2007. 2007, International Society for Optics and Photonics, 6498064980.
 Jomier J, Rit S, Oliva MV:Rtk: The reconstruction toolkit. [http://www.kitware.com/source/home/post/115],
 National Library of Medicine:Insight Segmentation and Registration Toolkit (ITK). [http://www.itk.org/],
 Noël PB, Walczak A, Hoffmann KR, Xu J, Corso JJ, Schafer S:Clinical evaluation of gpubased cone beam computed tomography. Proc. HighPerformance Comput Biomed Image Anal. 2008, [http://www.miccai.org/],
 Mukherjee S, Moore N, Brock J, Leeser M:CUDA and OpenCL implementations of 3D CT reconstruction for biomedical imaging. High Performance Extreme Computing (HPEC), Conference On. 2012, New York: IEEE, 16.
 OpenMP:OpenMP Standard Version 3.1. [http://www.openmp.org/mpdocuments/OpenMP3.1.pdf],
 NVIDIA:CUDA CUFFT Library. [http://docs.nvidia.com/cuda/cufft/index.html],
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.