1,096 research outputs found
Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators
We show that DNN accelerator micro-architectures and their program mappings
represent specific choices of loop order and hardware parallelism for computing
the seven nested loops of DNNs, which enables us to create a formal taxonomy of
all existing dense DNN accelerators. Surprisingly, the loop transformations
needed to create these hardware variants can be precisely and concisely
represented by Halide's scheduling language. By modifying the Halide compiler
to generate hardware, we create a system that can fairly compare these prior
accelerators. As long as proper loop blocking schemes are used, and the
hardware can support mapping replicated loops, many different hardware
dataflows yield similar energy efficiency with good performance. This is
because the loop blocking can ensure that most data references stay on-chip
with good locality and the processing units have high resource utilization. How
resources are allocated, especially in the memory system, has a large impact on
energy and performance. By optimizing hardware resource allocation while
keeping throughput constant, we achieve up to 4.2X energy improvement for
Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long
Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.Comment: Published as a conference paper at ASPLOS 202
Using Recurrent Neural Networks to Optimize Dynamical Decoupling for Quantum Memory
We utilize machine learning models which are based on recurrent neural
networks to optimize dynamical decoupling (DD) sequences. DD is a relatively
simple technique for suppressing the errors in quantum memory for certain noise
models. In numerical simulations, we show that with minimum use of prior
knowledge and starting from random sequences, the models are able to improve
over time and eventually output DD-sequences with performance better than that
of the well known DD-families. Furthermore, our algorithm is easy to implement
in experiments to find solutions tailored to the specific hardware, as it
treats the figure of merit as a black box.Comment: 18 pages, comments are welcom
Reconfigurable acceleration of Recurrent Neural Networks
Recurrent Neural Networks (RNNs) have been successful in a wide range of applications involving temporal sequences such as natural language processing, speech recognition and video analysis. However, RNNs often require a significant amount of memory and computational resources. In addition, the recurrent nature and data dependencies in RNN computations can lead to system stall, resulting in low throughput and high latency.
This work describes novel parallel hardware architectures for accelerating RNN inference using Field-Programmable Gate Array (FPGA) technology, which considers the data dependencies and high computational costs of RNNs.
The first contribution of this thesis is a latency-hiding architecture that utilizes column-wise matrix-vector multiplication instead of the conventional row-wise operation to eliminate data dependencies and improve the throughput of RNN inference designs. This architecture is further enhanced by a configurable checkerboard tiling strategy which allows large dimensions of weight matrices, while supporting element-based parallelism and vector-based parallelism. The presented reconfigurable RNN designs show significant speedup over CPU, GPU, and other FPGA designs.
The second contribution of this thesis is a weight reuse approach for large RNN models with weights stored in off-chip memory, running with a batch size of one. A novel blocking-batching strategy is proposed to optimize the throughput of large RNN designs on FPGAs by reusing the RNN weights. Performance analysis is also introduced to enable FPGA designs to achieve the best trade-off between area, power consumption and performance. Promising power efficiency improvement has been achieved in addition to speeding up over CPU and GPU designs.
The third contribution of this thesis is a low latency design for RNNs based on a partially-folded hardware architecture. It also introduces a technique that balances initiation interval of multi-layer RNN inferences to increase hardware efficiency and throughput while reducing latency. The approach is evaluated on a variety of applications, including gravitational wave detection and Bayesian RNN-based ECG anomaly detection.
To facilitate the use of this approach, we open source an RNN template which enables the generation of low-latency FPGA designs with efficient resource utilization using high-level synthesis tools.Open Acces
E-PUR: An Energy-Efficient Processing Unit for Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are a key technology for emerging
applications such as automatic speech recognition, machine translation or image
description. Long Short Term Memory (LSTM) networks are the most successful RNN
implementation, as they can learn long term dependencies to achieve high
accuracy. Unfortunately, the recurrent nature of LSTM networks significantly
constrains the amount of parallelism and, hence, multicore CPUs and many-core
GPUs exhibit poor efficiency for RNN inference. In this paper, we present
E-PUR, an energy-efficient processing unit tailored to the requirements of LSTM
computation. The main goal of E-PUR is to support large recurrent neural
networks for low-power mobile devices. E-PUR provides an efficient hardware
implementation of LSTM networks that is flexible to support diverse
applications. One of its main novelties is a technique that we call Maximizing
Weight Locality (MWL), which improves the temporal locality of the memory
accesses for fetching the synaptic weights, reducing the memory requirements by
a large extent. Our experimental results show that E-PUR achieves real-time
performance for different LSTM networks, while reducing energy consumption by
orders of magnitude with respect to general-purpose processors and GPUs, and it
requires a very small chip area. Compared to a modern mobile SoC, an NVIDIA
Tegra X1, E-PUR provides an average energy reduction of 92x
- …