Search CORE

2,333 research outputs found

Graphics processing unit accelerating compressed sensing photoacoustic computed tomography with total variation

Author: Bai Yuanyuan
Gao Mingjie
Liu Chengbo
Meng Jing
Si Guangtao
Wang Lihong V.
Publication venue: Optical Society of America
Publication date: 20/01/2020
Field of study

Photoacoustic computed tomography with compressed sensing (CS-PACT) is a commonly used imaging strategy for sparse-sampling PACT. However, it is very time-consuming because of the iterative process involved in the image reconstruction. In this paper, we present a graphics processing unit (GPU)-based parallel computation framework for total-variation-based CS-PACT and adapted into a custom-made PACT system. Specifically, five compute-intensive operators are extracted from the iteration algorithm and are redesigned for parallel performance on a GPU. We achieved an image reconstruction speed 24–31 times faster than the CPU performance. We performed in vivo experiments on human hands to verify the feasibility of our developed method

Caltech Authors

Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices

Author: A Dziekonski
AV Knyazev
AV Knyazev
C Yang
G Ortega
JW Choi
M Knap
M Shao
P Maris
P Maris
X Yang
Y Wang
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8

{\times }

–4.3

{\times }

speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9

{\times }

and 48.2

{\times }

speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively

Crossref

eScholarship - University of California