93 research outputs found
Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs
Convolution is a fundamental operation in many applications, such as computer
vision, natural language processing, image processing, etc. Recent successes of
convolutional neural networks in various deep learning applications put even
higher demand on fast convolution. The high computation throughput and memory
bandwidth of graphics processing units (GPUs) make GPUs a natural choice for
accelerating convolution operations. However, maximally exploiting the
available memory bandwidth of GPUs for convolution is a challenging task. This
paper introduces a general model to address the mismatch between the memory
bank width of GPUs and computation data width of threads. Based on this model,
we develop two convolution kernels, one for the general case and the other for
a special case with one input channel. By carefully optimizing memory access
patterns and computation patterns, we design a communication-optimized kernel
for the special case and a communication-reduced kernel for the general case.
Experimental data based on implementations on Kepler GPUs show that our kernels
achieve 5.16X and 35.5% average performance improvement over the latest cuDNN
library, for the special case and the general case, respectively
U-SWIM: Universal Selective Write-Verify for Computing-in-Memory Neural Accelerators
Architectures that incorporate Computing-in-Memory (CiM) using emerging
non-volatile memory (NVM) devices have become strong contenders for deep neural
network (DNN) acceleration due to their impressive energy efficiency. Yet, a
significant challenge arises when using these emerging devices: they can show
substantial variations during the weight-mapping process. This can severely
impact DNN accuracy if not mitigated. A widely accepted remedy for imperfect
weight mapping is the iterative write-verify approach, which involves verifying
conductance values and adjusting devices if needed. In all existing
publications, this procedure is applied to every individual device, resulting
in a significant programming time overhead. In our research, we illustrate that
only a small fraction of weights need this write-verify treatment for the
corresponding devices and the DNN accuracy can be preserved, yielding a notable
programming acceleration. Building on this, we introduce USWIM, a novel method
based on the second derivative. It leverages a single iteration of forward and
backpropagation to pinpoint the weights demanding write-verify. Through
extensive tests on diverse DNN designs and datasets, USWIM manifests up to a
10x programming acceleration against the traditional exhaustive write-verify
method, all while maintaining a similar accuracy level. Furthermore, compared
to our earlier SWIM technique, USWIM excels, showing a 7x speedup when dealing
with devices exhibiting non-uniform variations
Computing-In-Memory Neural Network Accelerators for Safety-Critical Systems: Can Small Device Variations Be Disastrous?
Computing-in-Memory (CiM) architectures based on emerging non-volatile memory
(NVM) devices have demonstrated great potential for deep neural network (DNN)
acceleration thanks to their high energy efficiency. However, NVM devices
suffer from various non-idealities, especially device-to-device variations due
to fabrication defects and cycle-to-cycle variations due to the stochastic
behavior of devices. As such, the DNN weights actually mapped to NVM devices
could deviate significantly from the expected values, leading to large
performance degradation. To address this issue, most existing works focus on
maximizing average performance under device variations. This objective would
work well for general-purpose scenarios. But for safety-critical applications,
the worst-case performance must also be considered. Unfortunately, this has
been rarely explored in the literature. In this work, we formulate the problem
of determining the worst-case performance of CiM DNN accelerators under the
impact of device variations. We further propose a method to effectively find
the specific combination of device variation in the high-dimensional space that
leads to the worst-case performance. We find that even with very small device
variations, the accuracy of a DNN can drop drastically, causing concerns when
deploying CiM accelerators in safety-critical applications. Finally, we show
that surprisingly none of the existing methods used to enhance average DNN
performance in CiM accelerators are very effective when extended to enhance the
worst-case performance, and further research down the road is needed to address
this problem
Compact and High-Performance TCAM Based on Scaled Double-Gate FeFETs
Ternary content addressable memory (TCAM), widely used in network routers and
high-associativity caches, is gaining popularity in machine learning and
data-analytic applications. Ferroelectric FETs (FeFETs) are a promising
candidate for implementing TCAM owing to their high ON/OFF ratio,
non-volatility, and CMOS compatibility. However, conventional single-gate
FeFETs (SG-FeFETs) suffer from relatively high write voltage, low endurance,
potential read disturbance, and face scaling challenges. Recently, a
double-gate FeFET (DG-FeFET) has been proposed and outperforms SG-FeFETs in
many aspects. This paper investigates TCAM design challenges specific to
DG-FeFETs and introduces a novel 1.5T1Fe TCAM design based on DG-FeFETs. A
2-step search with early termination is employed to reduce the cell area and
improve energy efficiency. A shared driver design is proposed to reduce the
peripherals area. Detailed analysis and SPICE simulation show that the 1.5T1Fe
DG-TCAM leads to superior search speed and energy efficiency. The 1.5T1Fe TCAM
design can also be built with SG-FeFETs, which achieve search latency and
energy improvement compared with 2FeFET TCAM.Comment: Accepted by Design Automation Conference (DAC) 202
- …