192 research outputs found
Maple: A Processing Element for Row-Wise Product Based Sparse Tensor Accelerators
Sparse tensor computing is a core computational part of numerous applications
in areas such as data science, graph processing, and scientific computing.
Sparse tensors offer the potential of skipping unnecessary computations caused
by zero values. In this paper, we propose a new strategy for extending row-wise
product sparse tensor accelerators. We propose a new processing element called
Maple that uses multiple multiply-accumulate (MAC) units to exploit local
clusters of non-zero values to increase parallelism and reduce data movement.
Maple works on the compressed sparse row (CSR) format and calculates only
non-zero elements of the input matrices based on the sparsity pattern.
Furthermore, we may employ Maple as a basic building block in a variety of
spatial tensor accelerators that operate based on a row-wise product approach.
As a proof of concept, we utilize Maple in two reference accelerators: Extensor
and Matraptor. Our experiments show that using Maple in Matraptor and Extensor
achieves 50% and 60% energy benefit and 15% and 22% speedup over the baseline
designs, respectively. Employing Maple also results in 5.9x and 15.5x smaller
area consumption in Matraptor and Extensor compared with the baseline
structures, respectively
NATSA: A Near-Data Processing Accelerator for Time Series Analysis
Time series analysis is a key technique for extracting and predicting events
in domains as diverse as epidemiology, genomics, neuroscience, environmental
sciences, economics, and more. Matrix profile, the state-of-the-art algorithm
to perform time series analysis, computes the most similar subsequence for a
given query subsequence within a sliced time series. Matrix profile has low
arithmetic intensity, but it typically operates on large amounts of time series
data. In current computing systems, this data needs to be moved between the
off-chip memory units and the on-chip computation units for performing matrix
profile. This causes a major performance bottleneck as data movement is
extremely costly in terms of both execution time and energy.
In this work, we present NATSA, the first Near-Data Processing accelerator
for time series analysis. The key idea is to exploit modern 3D-stacked High
Bandwidth Memory (HBM) to enable efficient and fast specialized matrix profile
computation near memory, where time series data resides. NATSA provides three
key benefits: 1) quickly computing the matrix profile for a wide range of
applications by building specialized energy-efficient floating-point arithmetic
processing units close to HBM, 2) improving the energy efficiency and execution
time by reducing the need for data movement over slow and energy-hungry buses
between the computation units and the memory units, and 3) analyzing time
series data at scale by exploiting low-latency, high-bandwidth, and
energy-efficient memory access provided by HBM. Our experimental evaluation
shows that NATSA improves performance by up to 14.2x (9.9x on average) and
reduces energy by up to 27.2x (19.4x on average), over the state-of-the-art
multi-core implementation. NATSA also improves performance by 6.3x and reduces
energy by 10.2x over a general-purpose NDP platform with 64 in-order cores.Comment: To appear in the 38th IEEE International Conference on Computer
Design (ICCD 2020
SMAUG: End-to-End Full-Stack Simulation Infrastructure for Deep Learning Workloads
In recent years, there has been tremendous advances in hardware acceleration
of deep neural networks. However, most of the research has focused on
optimizing accelerator microarchitecture for higher performance and energy
efficiency on a per-layer basis. We find that for overall single-batch
inference latency, the accelerator may only make up 25-40%, with the rest spent
on data movement and in the deep learning software framework. Thus far, it has
been very difficult to study end-to-end DNN performance during early stage
design (before RTL is available) because there are no existing DNN frameworks
that support end-to-end simulation with easy custom hardware accelerator
integration. To address this gap in research infrastructure, we present SMAUG,
the first DNN framework that is purpose-built for simulation of end-to-end deep
learning applications. SMAUG offers researchers a wide range of capabilities
for evaluating DNN workloads, from diverse network topologies to easy
accelerator modeling and SoC integration. To demonstrate the power and value of
SMAUG, we present case studies that show how we can optimize overall
performance and energy efficiency for up to 1.8-5x speedup over a baseline
system, without changing any part of the accelerator microarchitecture, as well
as show how SMAUG can tune an SoC for a camera-powered deep learning pipeline.Comment: 14 pages, 20 figure
- …