250 research outputs found
Near Memory Acceleration on High Resolution Radio Astronomy Imaging
Modern radio telescopes like the Square Kilometer Array (SKA) will need to
process in real-time exabytes of radio-astronomical signals to construct a
high-resolution map of the sky. Near-Memory Computing (NMC) could alleviate the
performance bottlenecks due to frequent memory accesses in a state-of-the-art
radio-astronomy imaging algorithm. In this paper, we show that a sub-module
performing a two-dimensional fast Fourier transform (2D FFT) is memory bound
using CPI breakdown analysis on IBM Power9. Then, we present an NMC approach on
FPGA for 2D FFT that outperforms a CPU by up to a factor of 120x and performs
comparably to a high-end GPU, while using less bandwidth and memory
Fast Matrix Multiplication via Compiler-only Layered Data Reorganization and Intrinsic Lowering
The resurgence of machine learning has increased the demand for
high-performance basic linear algebra subroutines (BLAS), which have long
depended on libraries to achieve peak performance on commodity hardware.
High-performance BLAS implementations rely on a layered approach that consists
of tiling and packing layers, for data (re)organization, and micro kernels that
perform the actual computations. The creation of high-performance micro kernels
requires significant development effort to write tailored assembly code for
each architecture. This hand optimization task is complicated by the recent
introduction of matrix engines by IBM's POWER10 MMA, Intel AMX, and Arm ME to
deliver high-performance matrix operations. This paper presents a compiler-only
alternative to the use of high-performance libraries by incorporating, to the
best of our knowledge and for the first time, the automatic generation of the
layered approach into LLVM, a production compiler. Modular design of the
algorithm, such as the use of LLVM's matrix-multiply intrinsic for a clear
interface between the tiling and packing layers and the micro kernel, makes it
easy to retarget the code generation to multiple accelerators. The use of
intrinsics enables a comprehensive performance study. In processors without
hardware matrix engines, the tiling and packing delivers performance up to 22x
(Intel), for small matrices, and more than 6x (POWER9), for large matrices,
faster than PLuTo, a widely used polyhedral optimizer. The performance also
approaches high-performance libraries and is only 34% slower than OpenBLAS and
on-par with Eigen for large matrices. With MMA in POWER10 this solution is, for
large matrices, over 2.6x faster than the vector-extension solution, matches
Eigen performance, and achieves up to 96% of BLAS peak performance
Exploiting Inter- and Intra-Memory Asymmetries for Data Mapping in Hybrid Tiered-Memories
Modern computing systems are embracing hybrid memory comprising of DRAM and
non-volatile memory (NVM) to combine the best properties of both memory
technologies, achieving low latency, high reliability, and high density. A
prominent characteristic of DRAM-NVM hybrid memory is that it has NVM access
latency much higher than DRAM access latency. We call this inter-memory
asymmetry. We observe that parasitic components on a long bitline are a major
source of high latency in both DRAM and NVM, and a significant factor
contributing to high-voltage operations in NVM, which impact their reliability.
We propose an architectural change, where each long bitline in DRAM and NVM is
split into two segments by an isolation transistor. One segment can be accessed
with lower latency and operating voltage than the other. By introducing tiers,
we enable non-uniform accesses within each memory type (which we call
intra-memory asymmetry), leading to performance and reliability trade-offs in
DRAM-NVM hybrid memory. We extend existing NVM-DRAM OS in three ways. First, we
exploit both inter- and intra-memory asymmetries to allocate and migrate memory
pages between the tiers in DRAM and NVM. Second, we improve the OS's page
allocation decisions by predicting the access intensity of a newly-referenced
memory page in a program and placing it to a matching tier during its initial
allocation. This minimizes page migrations during program execution, lowering
the performance overhead. Third, we propose a solution to migrate pages between
the tiers of the same memory without transferring data over the memory channel,
minimizing channel occupancy and improving performance. Our overall approach,
which we call MNEME, to enable and exploit asymmetries in DRAM-NVM hybrid
tiered memory improves both performance and reliability for both single-core
and multi-programmed workloads.Comment: 15 pages, 29 figures, accepted at ACM SIGPLAN International Symposium
on Memory Managemen
Asynchronous Runtime with Distributed Manager for Task-based Programming Models
Parallel task-based programming models, like OpenMP, allow application
developers to easily create a parallel version of their sequential codes. The
standard OpenMP 4.0 introduced the possibility of describing a set of data
dependences per task that the runtime uses to order the tasks execution. This
order is calculated using shared graphs, which are updated by all threads in
exclusive access using synchronization mechanisms (locks) to ensure the
dependence management correctness. The contention in the access to these
structures becomes critical in many-core systems because several threads may be
wasting computation resources waiting their turn.
This paper proposes an asynchronous management of the runtime structures,
like task dependence graphs, suitable for task-based programming model
runtimes. In such organization, the threads request actions to the runtime
instead of doing them directly. The requests are then handled by a distributed
runtime manager (DDAST) which does not require dedicated resources. Instead,
the manager uses the idle threads to modify the runtime structures. The paper
also presents an implementation, analysis and performance evaluation of such
runtime organization. The performance results show that the proposed
asynchronous organization outperforms the speedup obtained by the original
runtime for different benchmarks and different many-core architectures.Comment: 2020 Parallel Computin
LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment
Pairwise sequence alignment is one of the most computationally intensive
kernels in genomic data analysis, accounting for more than 90% of the runtime
for key bioinformatics applications. This method is particularly expensive for
third-generation sequences due to the high computational cost of analyzing
sequences of length between 1Kb and 1Mb. Given the quadratic overhead of exact
pairwise algorithms for long alignments, the community primarily relies on
approximate algorithms that search only for high-quality alignments and stop
early when one is not found. In this work, we present the first GPU
optimization of the popular X-drop alignment algorithm, that we named LOGAN.
Results show that our high-performance multi-GPU implementation achieves up to
181.6 GCUPS and speed-ups up to 6.6x and 30.7x using 1 and 6 NVIDIA Tesla V100,
respectively, over the state-of-the-art software running on two IBM Power9
processors using 168 CPU threads, with equivalent accuracy. We also demonstrate
a 2.3x LOGAN speed-up versus ksw2, a state-of-art vectorized algorithm for
sequence alignment implemented in minimap2, a long-read mapping software. To
highlight the impact of our work on a real-world application, we couple LOGAN
with a many-to-many long-read alignment software called BELLA, and demonstrate
that our implementation improves the overall BELLA runtime by up to 10.6x.
Finally, we adapt the Roofline model for LOGAN and demonstrate that our
implementation is near-optimal on the NVIDIA Tesla V100s
- …