102 research outputs found
The Parallelism Motifs of Genomic Data Analysis
Genomic data sets are growing dramatically as the cost of sequencing
continues to decline and small sequencing devices become available. Enormous
community databases store and share this data with the research community, but
some of these genomic data analysis problems require large scale computational
platforms to meet both the memory and computational requirements. These
applications differ from scientific simulations that dominate the workload on
high end parallel systems today and place different requirements on programming
support, software libraries, and parallel architectural design. For example,
they involve irregular communication patterns such as asynchronous updates to
shared data structures. We consider several problems in high performance
genomics analysis, including alignment, profiling, clustering, and assembly for
both single genomes and metagenomes. We identify some of the common
computational patterns or motifs that help inform parallelization strategies
and compare our motifs to some of the established lists, arguing that at least
two key patterns, sorting and hashing, are missing
BitGNN: Unleashing the Performance Potential of Binary Graph Neural Networks on GPUs
Recent studies have shown that Binary Graph Neural Networks (GNNs) are
promising for saving computations of GNNs through binarized tensors. Prior
work, however, mainly focused on algorithm designs or training techniques,
leaving it open to how to materialize the performance potential on accelerator
hardware fully. This work redesigns the binary GNN inference backend from the
efficiency perspective. It fills the gap by proposing a series of abstractions
and techniques to map binary GNNs and their computations best to fit the nature
of bit manipulations on GPUs. Results on real-world graphs with GCNs,
GraphSAGE, and GraphSAINT show that the proposed techniques outperform
state-of-the-art binary GNN implementations by 8-22X with the same accuracy
maintained. BitGNN code is publicly available.Comment: To appear in the International Conference on Supercomputing (ICS'23
Accurate Energy and Performance Prediction for Frequency-Scaled GPU Kernels
Energy optimization is an increasingly important aspect of today’s high-performance computing applications. In particular, dynamic voltage and frequency scaling (DVFS) has become a widely adopted solution to balance performance and energy consumption, and hardware vendors provide management libraries that allow the programmer to change both memory and core frequencies manually to minimize energy consumption while maximizing performance. This article focuses on modeling the energy consumption and speedup of GPU applications while using different frequency configurations. The task is not straightforward, because of the large set of possible and uniformly distributed configurations and because of the multi-objective nature of the problem, which minimizes energy consumption and maximizes performance. This article proposes a machine learning-based method to predict the best core and memory frequency configurations on GPUs for an input OpenCL kernel. The method is based on two models for speedup and normalized energy predictions over the default frequency configuration. Those are later combined into a multi-objective approach that predicts a Pareto-set of frequency configurations. Results show that our approach is very accurate at predicting extema and the Pareto set, and finds frequency configurations that dominate the default configuration in either energy or performance.DFG, 360291326, CELERITY: Innovative Modellierung für Skalierbare Verteilte Laufzeitsystem
Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs
General Matrix Multiplication (GEMM) is a crucial algorithm for various
applications such as machine learning and scientific computing, and an
efficient GEMM implementation is essential for the performance of these
systems. While researchers often strive for faster performance by using large
compute platforms, the increased scale of these systems can raise concerns
about hardware and software reliability. In this paper, we present a design for
a high-performance GEMM with algorithm-based fault tolerance for use on GPUs.
We describe fault-tolerant designs for GEMM at the thread, warp, and
threadblock levels, and also provide a baseline GEMM implementation that is
competitive with or faster than the state-of-the-art, proprietary cuBLAS GEMM.
We present a kernel fusion strategy to overlap and mitigate the memory latency
due to fault tolerance with the original GEMM computation. To support a wide
range of input matrix shapes and reduce development costs, we present a
template-based approach for automatic code generation for both fault-tolerant
and non-fault-tolerant GEMM implementations. We evaluate our work on NVIDIA
Tesla T4 and A100 server GPUs. Experimental results demonstrate that our
baseline GEMM presents comparable or superior performance compared to the
closed-source cuBLAS. The fault-tolerant GEMM incurs only a minimal overhead
(8.89\% on average) compared to cuBLAS even with hundreds of errors injected
per minute. For irregularly shaped inputs, the code generator-generated kernels
show remarkable speedups of and
for fault-tolerant and non-fault-tolerant GEMMs, outperforming cuBLAS by up to
.Comment: 11 pages, 2023 International Conference on Supercomputin
Adapting Datacenter Capacity for Greener Datacenters and Grid
Cloud providers are adapting datacenter (DC) capacity to reduce carbon
emissions. With hyperscale datacenters exceeding 100 MW individually, and in
some grids exceeding 15% of power load, DC adaptation is large enough to harm
power grid dynamics, increasing carbon emissions, power prices, or reduce grid
reliability.
To avoid harm, we explore coordination of DC capacity change varying scope in
space and time. In space, coordination scope spans a single datacenter, a group
of datacenters, and datacenters with the grid. In time, scope ranges from
online to day-ahead. We also consider what DC and grid information is used
(e.g. real-time and day-ahead average carbon, power price, and compute
backlog). For example, in our proposed PlanShare scheme, each datacenter uses
day-ahead information to create a capacity plan and shares it, allowing global
grid optimization (over all loads, over entire day).
We evaluate DC carbon emissions reduction. Results show that local
coordination scope fails to reduce carbon emissions significantly (3.2%--5.4%
reduction). Expanding coordination scope to a set of datacenters improves
slightly (4.9%--7.3%). PlanShare, with grid-wide coordination and full-day
capacity planning, performs the best. PlanShare reduces DC emissions by
11.6%--12.6%, 1.56x--1.26x better than the best local, online approach's
results. PlanShare also achieves lower cost. We expect these advantages to
increase as renewable generation in power grids increases. Further, a known
full-day DC capacity plan provides a stable target for DC resource management.Comment: Published at e-Energy '23: Proceedings of the 14th ACM International
Conference on Future Energy System
Near Memory Acceleration on High Resolution Radio Astronomy Imaging
Modern radio telescopes like the Square Kilometer Array (SKA) will need to
process in real-time exabytes of radio-astronomical signals to construct a
high-resolution map of the sky. Near-Memory Computing (NMC) could alleviate the
performance bottlenecks due to frequent memory accesses in a state-of-the-art
radio-astronomy imaging algorithm. In this paper, we show that a sub-module
performing a two-dimensional fast Fourier transform (2D FFT) is memory bound
using CPI breakdown analysis on IBM Power9. Then, we present an NMC approach on
FPGA for 2D FFT that outperforms a CPU by up to a factor of 120x and performs
comparably to a high-end GPU, while using less bandwidth and memory
Simurgh: a fully decentralized and secure NVMM user space file system
The availability of non-volatile main memory (NVMM) has started a new era for storage systems and NVMM specific file systems can support extremely high data and metadata rates, which are required by many HPC and data-intensive applications. Scaling metadata performance within NVMM file systems is nevertheless often restricted by the Linux kernel storage stack, while simply moving metadata management to the user space can compromise security or flexibility. This paper introduces Simurgh, a hardware-assisted user space file system with decentralized metadata management that allows secure metadata updates from within user space. Simurgh guarantees consistency, durability, and ordering of updates without sacrificing scalability. Security is enforced by only allowing NVMM access from protected user space functions, which can be implemented through two proposed instructions. Comparisons with other NVMM file systems show that Simurgh improves metadata performance up to 18x and application performance up to 89% compared to the second-fastest file system.This work has been supported by the European Comission’s BigStorage project H2020-MSCA-ITN2014-642963. It is also supported by the Big Data in Atmospheric Physics (BINARY) project, funded by the Carl Zeiss Foundation under Grant No.: P2018-02-003.Peer ReviewedPostprint (author's final draft
- …