103 research outputs found
Mixed-Precision Random Projection for RandNLA on Tensor Cores
Random projection can reduce the dimension of data while capturing its
structure and is a fundamental tool for machine learning, signal processing,
and information retrieval, which deal with a large amount of data today.
RandNLA (Randomized Numerical Linear Algebra) leverages random projection to
reduce the computational complexity of low-rank decomposition of tensors and
solve least-square problems. While the computation of the random projection is
a simple matrix multiplication, its asymptotic computational complexity is
typically larger than other operations in a RandNLA algorithm. Therefore,
various studies propose methods for reducing its computational complexity. We
propose a fast mixed-precision random projection method on NVIDIA GPUs using
Tensor Cores for single-precision tensors. We exploit the fact that the random
matrix requires less precision, and develop a highly optimized matrix
multiplication between FP32 and FP16 matrices -- SHGEMM (Single and
Half-precision GEMM) -- on Tensor Cores, where the random matrix is stored in
FP16. Our method can compute Randomized SVD 1.28 times faster and Random
projection high order SVD 1.75 times faster than baseline single-precision
implementations while maintaining accuracy.Comment: PASC'2
Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library
NVIDIA Tensor Core is a mixed-precision matrix-matrix multiplication and
addition computing unit, where the theoretical peak performance is more than
300 TFlop/s on NVIDIA A100 GPU. NVIDIA provides WMMA API for using Tensor Cores
in custom kernel functions. The most common way to use Tensor Core is to supply
the input matrices from shared memory, which has higher bandwidth than global
memory. However, the Bytes-per-Flops (B/F) ratio of the shared memory and
Tensor Cores is small since the performance of Tensor Cores is high. Thus, it
is important to reduce the shared memory footprint for efficient Tensor Cores
usage. In this paper, we analyze the simple matrix-matrix multiplication on
Tensor Cores by the roofline model and figure out that the bandwidth of shared
memory might be a limitation of the performance when using WMMA API. To
alleviate this issue, we provide a WMMA API extension library to boost the
throughput of the computation, which has two components. The first one allows
for manipulating the array of registers input to Tensor Cores flexibly. We
evaluate the performance improvement of this library. The outcome of our
evaluation shows that our library reduces the shared memory footprint and
speeds up the computation using Tensor Cores. The second one is an API for the
SGEMM emulation on Tensor Cores without additional shared memory usage. We have
demonstrated that the single-precision emulating batch SGEMM implementation on
Tensor Cores using this library achieves 54.2 TFlop/s on A100 GPU, which
outperforms the theoretical peak performance of FP32 SIMT Cores while achieving
the same level of accuracy as cuBLAS. The achieved throughput can not be
achieved without reducing the shared memory footprint done by our library with
the same amount of register usage.Comment: HPC Asia 202
DGEMM on Integer Matrix Multiplication Unit
Deep learning hardware achieves high throughput and low power consumption by
reducing computing precision and specializing in matrix multiplication. For
machine learning inference, fixed-point value computation is commonplace, where
the input and output values and the model parameters are quantized. Thus, many
processors are now equipped with fast integer matrix multiplication units
(IMMU). It is of significant interest to find a way to harness these IMMUs to
improve the performance of HPC applications while maintaining accuracy. We
focus on the Ozaki scheme, which computes a high-precision matrix
multiplication by using lower-precision computing units, and show the
advantages and disadvantages of using IMMU. The experiment using integer Tensor
Cores shows that we can compute double-precision matrix multiplication faster
than cuBLAS and an existing Ozaki scheme implementation on FP16 Tensor Cores on
NVIDIA consumer GPUs. Furthermore, we demonstrate accelerating a quantum
circuit simulation by up to 4.33 while maintaining the FP64 accuracy
Apixaban for the treatment of saphenous vein graft thrombosis presenting as unstable angina: a case report
Abstract Background Saphenous vein graft thrombosis can present as unstable angina. However, percutaneous coronary intervention for saphenous vein graft lesions poses a high risk of slow flow related to the procedure. Here we present the utilization of the novel oral anticoagulant, apixaban, in the treatment of unstable angina with extensive saphenous vein graft thrombus, leading to considerable thrombus resolution and eliminating the need of percutaneous coronary intervention. Case presentation A 72-year-old man with 3-vessel coronary artery bypass graft surgery using a saphenous vein graft and a left internal mammary artery, performed 25 years earlier, presented at our hospital with recurrent chest tightness. The echocardiography showed regional hypokinesis of the post-lateral wall with moderate left ventricular dysfunction, which had not been previously confirmed. Coronary angiography showed obstruction of the saphenous vein graft with a large thrombus burden. The left internal mammary artery was patent and other natives were the same as they had been 3 years ago. He was diagnosed with unstable angina due to acute saphenous vein graft thrombosis. Instead of percutaneous coronary intervention, he was treated with apixaban 5 mg twice a day. The angiography 3 weeks after starting apixaban showed considerable resolution of the thrombus and opening of the saphenous vein graft. Conclusions Apixaban could become a viable treatment option for acute saphenous vein graft thrombosis
Quantum Circuit Simulation by SGEMM Emulation on Tensor Cores and Automatic Precision Selection
Quantum circuit simulation provides the foundation for the development of
quantum algorithms and the verification of quantum supremacy. Among the various
methods for quantum circuit simulation, tensor network contraction has been
increasing in popularity due to its ability to simulate a larger number of
qubits. During tensor contraction, the input tensors are reshaped to matrices
and computed by a GEMM operation, where these GEMM operations could reach up to
90\% of the total calculation time. GEMM throughput can be improved by
utilizing mixed-precision hardware such as Tensor Cores, but straightforward
implementation results in insufficient fidelity for deep and large quantum
circuits. Prior work has demonstrated that compensated summation with special
care of the rounding mode can fully recover the FP32 precision of SGEMM even
when using TF32 or FP16 Tensor Cores. The exponent range is a critical issue
when applying such techniques to quantum circuit simulation. While TF32
supports almost the same exponent range as FP32, FP16 supports a much smaller
exponent range. In this work, we use the exponent range statistics of input
tensor elements to select which Tensor Cores we use for the GEMM. We evaluate
our method on Random Circuit Sampling (RCS), including Sycamore's quantum
circuit, and show that the throughput is 1.86 times higher at maximum while
maintaining accuracy.Comment: This paper has been accepted to ISC'2
CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs
Approximate Nearest Neighbor Search (ANNS) plays a critical role in various
disciplines spanning data mining and artificial intelligence, from information
retrieval and computer vision to natural language processing and recommender
systems. Data volumes have soared in recent years and the computational cost of
an exhaustive exact nearest neighbor search is often prohibitive, necessitating
the adoption of approximate techniques. The balanced performance and recall of
graph-based approaches have more recently garnered significant attention in
ANNS algorithms, however, only a few studies have explored harnessing the power
of GPUs and multi-core processors despite the widespread use of massively
parallel and general-purpose computing. To bridge this gap, we introduce a
novel parallel computing hardware-based proximity graph and search algorithm.
By leveraging the high-performance capabilities of modern hardware, our
approach achieves remarkable efficiency gains. In particular, our method
surpasses existing CPU and GPU-based methods in constructing the proximity
graph, demonstrating higher throughput in both large- and small-batch searches
while maintaining compatible accuracy. In graph construction time, our method,
CAGRA, is 2.2~27x faster than HNSW, which is one of the CPU SOTA
implementations. In large-batch query throughput in the 90% to 95% recall
range, our method is 33~77x faster than HNSW, and is 3.8~8.8x faster than the
SOTA implementations for GPU. For a single query, our method is 3.4~53x faster
than HNSW at 95% recall
- …