25 research outputs found
TransPimLib: A Library for Efficient Transcendental Functions on Processing-in-Memory Systems
Processing-in-memory (PIM) promises to alleviate the data movement bottleneck
in modern computing systems. However, current real-world PIM systems have the
inherent disadvantage that their hardware is more constrained than in
conventional processors (CPU, GPU), due to the difficulty and cost of building
processing elements near or inside the memory. As a result, general-purpose PIM
architectures support fairly limited instruction sets and struggle to execute
complex operations such as transcendental functions and other hard-to-calculate
operations (e.g., square root). These operations are particularly important for
some modern workloads, e.g., activation functions in machine learning
applications.
In order to provide support for transcendental (and other hard-to-calculate)
functions in general-purpose PIM systems, we present \emph{TransPimLib}, a
library that provides CORDIC-based and LUT-based methods for trigonometric
functions, hyperbolic functions, exponentiation, logarithm, square root, etc.
We develop an implementation of TransPimLib for the UPMEM PIM architecture and
perform a thorough evaluation of TransPimLib's methods in terms of performance
and accuracy, using microbenchmarks and three full workloads (Blackscholes,
Sigmoid, Softmax). We open-source all our code and datasets
at~\url{https://github.com/CMU-SAFARI/transpimlib}.Comment: Our open-source software is available at
https://github.com/CMU-SAFARI/transpimli
Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction
Long-latency load requests continue to limit the performance of
high-performance processors. To increase the latency tolerance of a processor,
architects have primarily relied on two key techniques: sophisticated data
prefetchers and large on-chip caches. In this work, we show that: 1) even a
sophisticated state-of-the-art prefetcher can only predict half of the off-chip
load requests on average across a wide range of workloads, and 2) due to the
increasing size and complexity of on-chip caches, a large fraction of the
latency of an off-chip load request is spent accessing the on-chip cache
hierarchy. The goal of this work is to accelerate off-chip load requests by
removing the on-chip cache access latency from their critical path. To this
end, we propose a new technique called Hermes, whose key idea is to: 1)
accurately predict which load requests might go off-chip, and 2) speculatively
fetch the data required by the predicted off-chip loads directly from the main
memory, while also concurrently accessing the cache hierarchy for such loads.
To enable Hermes, we develop a new lightweight, perceptron-based off-chip load
prediction technique that learns to identify off-chip load requests using
multiple program features (e.g., sequence of program counters). For every load
request, the predictor observes a set of program features to predict whether or
not the load would go off-chip. If the load is predicted to go off-chip, Hermes
issues a speculative request directly to the memory controller once the load's
physical address is generated. If the prediction is correct, the load
eventually misses the cache hierarchy and waits for the ongoing speculative
request to finish, thus hiding the on-chip cache hierarchy access latency from
the critical path of the off-chip load. Our evaluation shows that Hermes
significantly improves performance of a state-of-the-art baseline. We
open-source Hermes.Comment: To appear in 55th IEEE/ACM International Symposium on
Microarchitecture (MICRO), 202
DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks
Data movement between the CPU and main memory is a first-order obstacle
against improving performance, scalability, and energy efficiency in modern
systems. Computer systems employ a range of techniques to reduce overheads tied
to data movement, spanning from traditional mechanisms (e.g., deep multi-level
cache hierarchies, aggressive hardware prefetchers) to emerging techniques such
as Near-Data Processing (NDP), where some computation is moved close to memory.
Our goal is to methodically identify potential sources of data movement over a
broad set of applications and to comprehensively compare traditional
compute-centric data movement mitigation techniques to more memory-centric
techniques, thereby developing a rigorous understanding of the best techniques
to mitigate each source of data movement.
With this goal in mind, we perform the first large-scale characterization of
a wide variety of applications, across a wide range of application domains, to
identify fundamental program properties that lead to data movement to/from main
memory. We develop the first systematic methodology to classify applications
based on the sources contributing to data movement bottlenecks. From our
large-scale characterization of 77K functions across 345 applications, we
select 144 functions to form the first open-source benchmark suite (DAMOV) for
main memory data movement studies. We select a diverse range of functions that
(1) represent different types of data movement bottlenecks, and (2) come from a
wide range of application domains. Using NDP as a case study, we identify new
insights about the different data movement bottlenecks and use these insights
to determine the most suitable data movement mitigation mechanism for a
particular application. We open-source DAMOV and the complete source code for
our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at
https://github.com/CMU-SAFARI/DAMO
GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping
Nanopore sequencing is a widely-used high-throughput genome sequencing
technology that can sequence long fragments of a genome into raw electrical
signals at low cost. Nanopore sequencing requires two computationally-costly
processing steps for accurate downstream genome analysis. The first step,
basecalling, translates the raw electrical signals into nucleotide bases (i.e.,
A, C, G, T). The second step, read mapping, finds the correct location of a
read in a reference genome. In existing genome analysis pipelines, basecalling
and read mapping are executed separately. We observe in this work that such
separate execution of the two most time-consuming steps inherently leads to (1)
significant data movement and (2) redundant computations on the data, slowing
down the genome analysis pipeline. This paper proposes GenPIP, an in-memory
genome analysis accelerator that tightly integrates basecalling and read
mapping. GenPIP improves the performance of the genome analysis pipeline with
two key mechanisms: (1) in-memory fine-grained collaborative execution of the
major genome analysis steps in parallel; (2) a new technique for
early-rejection of low-quality and unmapped reads to timely stop the execution
of genome analysis for such reads, reducing inefficient computation. Our
experiments show that, for the execution of the genome analysis pipeline,
GenPIP provides 41.6X (8.4X) speedup and 32.8X (20.8X) energy savings with
negligible accuracy loss compared to the state-of-the-art software genome
analysis tools executed on a state-of-the-art CPU (GPU). Compared to a design
that combines state-of-the-art in-memory basecalling and read mapping
accelerators, GenPIP provides 1.39X speedup and 1.37X energy savings.Comment: 17 pages, 13 figure
Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources
Address translation is a performance bottleneck in data-intensive workloads
due to large datasets and irregular access patterns that lead to frequent
high-latency page table walks (PTWs). PTWs can be reduced by using (i) large
hardware TLBs or (ii) large software-managed TLBs. Unfortunately, both
solutions have significant drawbacks: increased access latency, power and area
(for hardware TLBs), and costly memory accesses, the need for large contiguous
memory blocks, and complex OS modifications (for software-managed TLBs). We
present Victima, a new software-transparent mechanism that drastically
increases the translation reach of the processor by leveraging the
underutilized resources of the cache hierarchy. The key idea of Victima is to
repurpose L2 cache blocks to store clusters of TLB entries, thereby providing
an additional low-latency and high-capacity component that backs up the
last-level TLB and thus reduces PTWs. Victima has two main components. First, a
PTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on
the frequency and cost of the PTWs they lead to. Second, a TLB-aware cache
replacement policy prioritizes keeping TLB entries in the cache hierarchy by
considering (i) the translation pressure (e.g., last-level TLB miss rate) and
(ii) the reuse characteristics of the TLB entries. Our evaluation results show
that in native (virtualized) execution environments Victima improves average
end-to-end application performance by 7.4% (28.7%) over the baseline four-level
radix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art
software-managed TLB, across 11 diverse data-intensive workloads. Victima (i)
is effective in both native and virtualized environments, (ii) is completely
transparent to application and system software, and (iii) incurs very small
area and power overheads on a modern high-end CPU.Comment: To appear in 56th IEEE/ACM International Symposium on
Microarchitecture (MICRO), 202
TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering
Basecalling is an essential step in nanopore sequencing analysis where the
raw signals of nanopore sequencers are converted into nucleotide sequences,
i.e., reads. State-of-the-art basecallers employ complex deep learning models
to achieve high basecalling accuracy. This makes basecalling
computationally-inefficient and memory-hungry; bottlenecking the entire genome
analysis pipeline. However, for many applications, the majority of reads do no
match the reference genome of interest (i.e., target reference) and thus are
discarded in later steps in the genomics pipeline, wasting the basecalling
computation. To overcome this issue, we propose TargetCall, the first fast and
widely-applicable pre-basecalling filter to eliminate the wasted computation in
basecalling. TargetCall's key idea is to discard reads that will not match the
target reference (i.e., off-target reads) prior to basecalling. TargetCall
consists of two main components: (1) LightCall, a lightweight neural network
basecaller that produces noisy reads; and (2) Similarity Check, which labels
each of these noisy reads as on-target or off-target by matching them to the
target reference. TargetCall filters out all off-target reads before
basecalling; and the highly-accurate but slow basecalling is performed only on
the raw signals whose noisy reads are labeled as on-target. Our thorough
experimental evaluations using both real and simulated data show that
TargetCall 1) improves the end-to-end basecalling performance of the
state-of-the-art basecaller by 3.31x while maintaining high (98.88%)
sensitivity in keeping on-target reads, 2) maintains high accuracy in
downstream analysis, 3) precisely filters out up to 94.71% of off-target reads,
and 4) achieves better performance, sensitivity, and generality compared to
prior works. We freely open-source TargetCall at
https://github.com/CMU-SAFARI/TargetCall
SwiftRL: Towards Efficient Reinforcement Learning on Real Processing-In-Memory Systems
Reinforcement Learning (RL) trains agents to learn optimal behavior by
maximizing reward signals from experience datasets. However, RL training often
faces memory limitations, leading to execution latencies and prolonged training
times. To overcome this, SwiftRL explores Processing-In-Memory (PIM)
architectures to accelerate RL workloads. We achieve near-linear performance
scaling by implementing RL algorithms like Tabular Q-learning and SARSA on
UPMEM PIM systems and optimizing for hardware. Our experiments on OpenAI GYM
environments using UPMEM hardware demonstrate superior performance compared to
CPU and GPU implementations
RowPress: Amplifying Read Disturbance in Modern DRAM Chips
Memory isolation is critical for system reliability, security, and safety.
Unfortunately, read disturbance can break memory isolation in modern DRAM
chips. For example, RowHammer is a well-studied read-disturb phenomenon where
repeatedly opening and closing (i.e., hammering) a DRAM row many times causes
bitflips in physically nearby rows.
This paper experimentally demonstrates and analyzes another widespread
read-disturb phenomenon, RowPress, in real DDR4 DRAM chips. RowPress breaks
memory isolation by keeping a DRAM row open for a long period of time, which
disturbs physically nearby rows enough to cause bitflips. We show that RowPress
amplifies DRAM's vulnerability to read-disturb attacks by significantly
reducing the number of row activations needed to induce a bitflip by one to two
orders of magnitude under realistic conditions. In extreme cases, RowPress
induces bitflips in a DRAM row when an adjacent row is activated only once. Our
detailed characterization of 164 real DDR4 DRAM chips shows that RowPress 1)
affects chips from all three major DRAM manufacturers, 2) gets worse as DRAM
technology scales down to smaller node sizes, and 3) affects a different set of
DRAM cells from RowHammer and behaves differently from RowHammer as temperature
and access pattern changes.
We demonstrate in a real DDR4-based system with RowHammer protection that 1)
a user-level program induces bitflips by leveraging RowPress while conventional
RowHammer cannot do so, and 2) a memory controller that adaptively keeps the
DRAM row open for a longer period of time based on access pattern can
facilitate RowPress-based attacks. To prevent bitflips due to RowPress, we
describe and evaluate a new methodology that adapts existing RowHammer
mitigation techniques to also mitigate RowPress with low additional performance
overhead. We open source all our code and data to facilitate future research on
RowPress.Comment: Extended version of the paper "RowPress: Amplifying Read Disturbance
in Modern DRAM Chips" at the 50th Annual International Symposium on Computer
Architecture (ISCA), 202
Utopia: Fast and Efficient Address Translation via Hybrid Restrictive & Flexible Virtual-to-Physical Address Mappings
Conventional virtual memory (VM) frameworks enable a virtual address to
flexibly map to any physical address. This flexibility necessitates large data
structures to store virtual-to-physical mappings, which leads to high address
translation latency and large translation-induced interference in the memory
hierarchy. On the other hand, restricting the address mapping so that a virtual
address can only map to a specific set of physical addresses can significantly
reduce address translation overheads by using compact and efficient translation
structures. However, restricting the address mapping flexibility across the
entire main memory severely limits data sharing across different processes and
increases data accesses to the swap space of the storage device, even in the
presence of free memory. We propose Utopia, a new hybrid virtual-to-physical
address mapping scheme that allows both flexible and restrictive hash-based
address mapping schemes to harmoniously co-exist in the system. The key idea of
Utopia is to manage physical memory using two types of physical memory
segments: restrictive and flexible segments. A restrictive segment uses a
restrictive, hash-based address mapping scheme that maps virtual addresses to
only a specific set of physical addresses and enables faster address
translation using compact translation structures. A flexible segment employs
the conventional fully-flexible address mapping scheme. By mapping data to a
restrictive segment, Utopia enables faster address translation with lower
translation-induced interference. Utopia improves performance by 24% in a
single-core system over the baseline system, whereas the best prior
state-of-the-art contiguity-aware translation scheme improves performance by
13%.Comment: To appear in 56th IEEE/ACM International Symposium on
Microarchitecture (MICRO), 202