107 research outputs found
EnforceSNN: Enabling Resilient and Energy-Efficient Spiking Neural Network Inference considering Approximate DRAMs for Embedded Systems
Spiking Neural Networks (SNNs) have shown capabilities of achieving high
accuracy under unsupervised settings and low operational power/energy due to
their bio-plausible computations. Previous studies identified that DRAM-based
off-chip memory accesses dominate the energy consumption of SNN processing.
However, state-of-the-art works do not optimize the DRAM energy-per-access,
thereby hindering the SNN-based systems from achieving further energy
efficiency gains. To substantially reduce the DRAM energy-per-access, an
effective solution is to decrease the DRAM supply voltage, but it may lead to
errors in DRAM cells (i.e., so-called approximate DRAM). Towards this, we
propose \textit{EnforceSNN}, a novel design framework that provides a solution
for resilient and energy-efficient SNN inference using reduced-voltage DRAM for
embedded systems. The key mechanisms of our EnforceSNN are: (1) employing
quantized weights to reduce the DRAM access energy; (2) devising an efficient
DRAM mapping policy to minimize the DRAM energy-per-access; (3) analyzing the
SNN error tolerance to understand its accuracy profile considering different
bit error rate (BER) values; (4) leveraging the information for developing an
efficient fault-aware training (FAT) that considers different BER values and
bit error locations in DRAM to improve the SNN error tolerance; and (5)
developing an algorithm to select the SNN model that offers good trade-offs
among accuracy, memory, and energy consumption. The experimental results show
that our EnforceSNN maintains the accuracy (i.e., no accuracy loss for BER
less-or-equal 10^-3) as compared to the baseline SNN with accurate DRAM, while
achieving up to 84.9\% of DRAM energy saving and up to 4.1x speed-up of DRAM
data throughput across different network sizes.Comment: Accepted for publication at Frontiers in Neuroscience - Section
Neuromorphic Engineerin
An In-Memory Architecture for High-Performance Long-Read Pre-Alignment Filtering
With the recent move towards sequencing of accurate long reads, finding
solutions that support efficient analysis of these reads becomes more
necessary. The long execution time required for sequence alignment of long
reads negatively affects genomic studies relying on sequence alignment.
Although pre-alignment filtering as an extra step before alignment was recently
introduced to mitigate sequence alignment for short reads, these filters do not
work as efficiently for long reads. Moreover, even with efficient pre-alignment
filters, the overall end-to-end (i.e., filtering + original alignment)
execution time of alignment for long reads remains high, while the filtering
step is now a major portion of the end-to-end execution time.
Our paper makes three contributions. First, it identifies data movement of
sequences between memory units and computing units as the main source of
inefficiency for pre-alignment filters of long reads. This is because although
filters reject many of these long sequencing pairs before they get to the
alignment stage, they still require a huge cost regarding time and energy
consumption for the large data transferred between memory and processor.
Second, this paper introduces an adaptation of a short-read pre-alignment
filtering algorithm suitable for long reads. We call this LongGeneGuardian.
Finally, it presents Filter-Fuse as an architecture that supports
LongGeneGuardian inside the memory. FilterFuse exploits the
Computation-In-Memory computing paradigm, eliminating the cost of data movement
in LongGeneGuardian.
Our evaluations show that FilterFuse improves the execution time of filtering
by 120.47x for long reads compared to State-of-the-Art (SoTA) filter,
SneakySnake. FilterFuse also improves the end-to-end execution time of sequence
alignment by up to 49.14x and 5207.63x compared to SneakySnake with SoTA
aligner and only SoTA aligner, respectively
DRAM Bender: An Extensible and Versatile FPGA-based Infrastructure to Easily Test State-of-the-art DRAM Chips
To understand and improve DRAM performance, reliability, security and energy
efficiency, prior works study characteristics of commodity DRAM chips.
Unfortunately, state-of-the-art open source infrastructures capable of
conducting such studies are obsolete, poorly supported, or difficult to use, or
their inflexibility limit the types of studies they can conduct.
We propose DRAM Bender, a new FPGA-based infrastructure that enables
experimental studies on state-of-the-art DRAM chips. DRAM Bender offers three
key features at the same time. First, DRAM Bender enables directly interfacing
with a DRAM chip through its low-level interface. This allows users to issue
DRAM commands in arbitrary order and with finer-grained time intervals compared
to other open source infrastructures. Second, DRAM Bender exposes easy-to-use
C++ and Python programming interfaces, allowing users to quickly and easily
develop different types of DRAM experiments. Third, DRAM Bender is easily
extensible. The modular design of DRAM Bender allows extending it to (i)
support existing and emerging DRAM interfaces, and (ii) run on new commercial
or custom FPGA boards with little effort.
To demonstrate that DRAM Bender is a versatile infrastructure, we conduct
three case studies, two of which lead to new observations about the DRAM
RowHammer vulnerability. In particular, we show that data patterns supported by
DRAM Bender uncovers a larger set of bit-flips on a victim row compared to the
data patterns commonly used by prior work. We demonstrate the extensibility of
DRAM Bender by implementing it on five different FPGAs with DDR4 and DDR3
support. DRAM Bender is freely and openly available at
https://github.com/CMU-SAFARI/DRAM-Bender.Comment: To appear in TCAD 202
An Experimental Analysis of RowHammer in HBM2 DRAM Chips
RowHammer (RH) is a significant and worsening security, safety, and
reliability issue of modern DRAM chips that can be exploited to break memory
isolation. Therefore, it is important to understand real DRAM chips' RH
characteristics. Unfortunately, no prior work extensively studies the RH
vulnerability of modern 3D-stacked high-bandwidth memory (HBM) chips, which are
commonly used in modern GPUs.
In this work, we experimentally characterize the RH vulnerability of a real
HBM2 DRAM chip. We show that 1) different 3D-stacked channels of HBM2 memory
exhibit significantly different levels of RH vulnerability (up to 79%
difference in bit error rate), 2) the DRAM rows at the end of a DRAM bank (rows
with the highest addresses) exhibit significantly fewer RH bitflips than other
rows, and 3) a modern HBM2 DRAM chip implements undisclosed RH defenses that
are triggered by periodic refresh operations. We describe the implications of
our observations on future RH attacks and defenses and discuss future work for
understanding RH in 3D-stacked memories.Comment: To appear at DSN Disrupt 202
LLM: Realizing Low-Latency Memory by Exploiting Embedded Silicon Photonics for Irregular Workloads
As emerging workloads exhibit irregular memory access patterns with poor data reuse and locality, they would benefit from a DRAM that achieves low latency without sacrificing bandwidth and energy efficiency. We propose LLM (Low Latency Memory), a codesign of the DRAM microarchitecture, the memory controller and the LLC/DRAM interconnect by leveraging embedded silicon photonics in 2.5D/3D integrated system on chip. LLM relies on Wavelength Division Multiplexing (WDM)-based photonic interconnects to reduce the contention throughout the memory subsystem. LLM also increases the bank-level parallelism, eliminates bus conflicts by using dedicated optical data paths, and reduces the access energy per bit with shorter global bitlines and smaller row buffers. We evaluate the design space of LLM for a variety of synthetic benchmarks and representative graph workloads on a full-system simulator (gem5). LLM exhibits low memory access latency for traffics with both regular and irregular access patterns. For irregular traffic, LLM achieves high bandwidth utilization (over 80% peak throughput compared to 20% of HBM2.0). For real workloads, LLM achieves 3 × and 1.8 × lower execution time compared to HBM2.0 and a state-of-the-art memory system with high memory level parallelism, respectively. This study also demonstrates that by reducing queuing on the data path, LLM can achieve on average 3.4 × lower memory latency variation compared to HBM2.0
A Case for Self-Managing DRAM Chips: Improving Performance, Efficiency, Reliability, and Security via Autonomous in-DRAM Maintenance Operations
The memory controller is in charge of managing DRAM maintenance operations
(e.g., refresh, RowHammer protection, memory scrubbing) in current DRAM chips.
Implementing new maintenance operations often necessitates modifications in the
DRAM interface, memory controller, and potentially other system components.
Such modifications are only possible with a new DRAM standard, which takes a
long time to develop, leading to slow progress in DRAM systems.
In this paper, our goal is to 1) ease, and thus accelerate, the process of
enabling new DRAM maintenance operations and 2) enable more efficient in-DRAM
maintenance operations. Our idea is to set the memory controller free from
managing DRAM maintenance. To this end, we propose Self-Managing DRAM (SMD), a
new low-cost DRAM architecture that enables implementing new in-DRAM
maintenance mechanisms (or modifying old ones) with no further changes in the
DRAM interface, memory controller, or other system components. We use SMD to
implement new in-DRAM maintenance mechanisms for three use cases: 1) periodic
refresh, 2) RowHammer protection, and 3) memory scrubbing. We show that SMD
enables easy adoption of efficient maintenance mechanisms that significantly
improve the system performance and energy efficiency while providing higher
reliability compared to conventional DDR4 DRAM. A combination of SMD-based
maintenance mechanisms that perform refresh, RowHammer protection, and memory
scrubbing achieve 7.6% speedup and consume 5.2% less DRAM energy on average
across 20 memory-intensive four-core workloads. We make SMD source code openly
and freely available at [128]
PIRM: Processing In Racetrack Memories
The growth in data needs of modern applications has created significant
challenges for modern systems leading a "memory wall." Spintronic Domain Wall
Memory (DWM), related to Spin-Transfer Torque Memory (STT-MRAM), provides
near-SRAM read/write performance, energy savings and nonvolatility, potential
for extremely high storage density, and does not have significant endurance
limitations. However, DWM's benefits cannot address data access latency and
throughput limitations of memory bus bandwidth. We propose PIRM, a DWM-based
in-memory computing solution that leverages the properties of DWM nanowires and
allows them to serve as polymorphic gates. While normally DWM is accessed by
applying spin polarized currents orthogonal to the nanowire at access points to
read individual bits, transverse access along the DWM nanowire allows the
differentiation of the aggregate resistance of multiple bits in the nanowire,
akin to a multilevel cell. PIRM leverages this transverse reading to directly
provide bulk-bitwise logic of multiple adjacent operands in the nanowire,
simultaneously. Based on this in-memory logic, PIRM provides a technique to
conduct multi-operand addition and two operand multiplication using transverse
access. PIRM provides a 1.6x speedup compared to the leading DRAM PIM technique
for query applications that leverage bulk bitwise operations. Compared to the
leading PIM technique for DWM, PIRM improves performance by 6.9x, 2.3x and
energy by 5.5x, 3.4x for 8-bit addition and multiplication, respectively. For
arithmetic heavy benchmarks, PIRM reduces access latency by 2.1x, while
decreasing energy consumption by 25.2x for a reasonable 10% area overhead
versus non-PIM DWM.Comment: This paper is accepted to the IEEE/ACM Symposium on
Microarchitecture, October 2022 under the title "CORUSCANT: Fast Efficient
Processing-in-Racetrack Memories
SpyHammer: Using RowHammer to Remotely Spy on Temperature
RowHammer is a DRAM vulnerability that can cause bit errors in a victim DRAM
row by just accessing its neighboring DRAM rows at a high-enough rate. Recent
studies demonstrate that new DRAM devices are becoming increasingly more
vulnerable to RowHammer, and many works demonstrate system-level attacks for
privilege escalation or information leakage. In this work, we leverage two key
observations about RowHammer characteristics to spy on DRAM temperature: 1)
RowHammer-induced bit error rate consistently increases (or decreases) as the
temperature increases, and 2) some DRAM cells that are vulnerable to RowHammer
cause bit errors only at a particular temperature. Based on these observations,
we propose a new RowHammer attack, called SpyHammer, that spies on the
temperature of critical systems such as industrial production lines, vehicles,
and medical systems. SpyHammer is the first practical attack that can spy on
DRAM temperature. SpyHammer can spy on absolute temperature with an error of
less than 2.5 {\deg}C at the 90th percentile of tested temperature points, for
12 real DRAM modules from 4 main manufacturers
DNA Pre-alignment Filter using Processing Near Racetrack Memory
Recent DNA pre-alignment filter designs employ DRAM for storing the reference
genome and its associated meta-data. However, DRAM incurs increasingly high
energy consumption background and refresh energy as devices scale. To overcome
this problem, this paper explores a design with racetrack memory (RTM)--an
emerging non-volatile memory that promises higher storage density, faster
access latency, and lower energy consumption. Multi-bit storage cells in RTM
are inherently sequential and thus require data placement strategies to
mitigate the performance and energy impacts of shifting during data accesses.
We propose a near-memory pre-alignment filter with a novel data mapping and
several shift reduction strategies designed explicitly for RTM. On a set of
four input genomes from the 1000 Genome Project, our approach improves
performance and energy efficiency by 68% and 52%, respectively, compared to the
state of the art proposed DRAM-based architecture
- …