15 research outputs found
FPGA-Based Low-Power Speech Recognition with Recurrent Neural Networks
In this paper, a neural network based real-time speech recognition (SR)
system is developed using an FPGA for very low-power operation. The implemented
system employs two recurrent neural networks (RNNs); one is a
speech-to-character RNN for acoustic modeling (AM) and the other is for
character-level language modeling (LM). The system also employs a statistical
word-level LM to improve the recognition accuracy. The results of the AM, the
character-level LM, and the word-level LM are combined using a fairly simple
N-best search algorithm instead of the hidden Markov model (HMM) based network.
The RNNs are implemented using massively parallel processing elements (PEs) for
low latency and high throughput. The weights are quantized to 6 bits to store
all of them in the on-chip memory of an FPGA. The proposed algorithm is
implemented on a Xilinx XC7Z045, and the system can operate much faster than
real-time.Comment: Accepted to SiPS 201
Implementación de redes neuronales en FPGAs usando tipos de datos de punto fijo
La capacidad de estimar funciones no lineales hace que las redes neuronales sean una de las herramientas más usadas para aplicar fusión sensorial, permitiendo combinar la salida de diferentes sensores para obtener información de la que a priori no se dispone. Por otra parte, la capacidad de procesamiento paralelo de las FPGAs (Field-Programmable Gate Array) las hace idóneas para implementar redes neuronales ubicuas, permitiendo inferir resultados más rápido que una CPU (Central Processing Unit) sin necesidad de una conexión activa a internet. De esta forma, en este artículo se propone un flujo de trabajo para diseñar, entrenar e implementar una red neuronal en una FPGA Xilinx PYNQ Z2 que use tipos de dato de punto fijo para hacer fusión sensorial. Dicho flujo de trabajo es probado mediante el desarrollo de una red neuronal que combine las salidas de una nariz artificial de 16 sensores para obtener una estimación de las concentraciones de CH4 y C2H4
Neuron-level fuzzy memoization in RNNs
The final publication is available at ACM via http://dx.doi.org/10.1145/3352460.3358309Recurrent Neural Networks (RNNs) are a key technology for applications such as automatic speech recognition or machine translation. Unlike conventional feed-forward DNNs, RNNs remember past information to improve the accuracy of future predictions and, therefore, they are very effective for sequence processing problems.
For each application run, each recurrent layer is executed many times for processing a potentially large sequence of inputs (words, images, audio frames, etc.). In this paper, we make the observation that the output of a neuron exhibits small changes in consecutive invocations. We exploit this property to build a neuron-level fuzzy memoization scheme, which dynamically caches the output of each neuron and reuses it whenever it is predicted that the current output will be similar to a previously computed result, avoiding in this way the output computations.
The main challenge in this scheme is determining whether the new neuron's output for the current input in the sequence will be similar to a recently computed result. To this end, we extend the recurrent layer with a much simpler Bitwise Neural Network (BNN), and show that the BNN and RNN outputs are highly correlated: if two BNN outputs are very similar, the corresponding outputs in the original RNN layer are likely to exhibit negligible changes. The BNN provides a low-cost and effective mechanism for deciding when fuzzy memoization can be applied with a small impact on accuracy.
We evaluate our memoization scheme on top of a state-of-the-art accelerator for RNNs, for a variety of different neural networks from multiple application domains. We show that our technique avoids more than 24.2% of computations, resulting in 18.5% energy savings and 1.35x speedup on average.Peer ReviewedPostprint (author's final draft
Using LSTM recurrent neural networks for monitoring the LHC superconducting magnets
The superconducting LHC magnets are coupled with an electronic monitoring
system which records and analyses voltage time series reflecting their
performance. A currently used system is based on a range of preprogrammed
triggers which launches protection procedures when a misbehavior of the magnets
is detected. All the procedures used in the protection equipment were designed
and implemented according to known working scenarios of the system and are
updated and monitored by human operators.
This paper proposes a novel approach to monitoring and fault protection of
the Large Hadron Collider (LHC) superconducting magnets which employs
state-of-the-art Deep Learning algorithms. Consequently, the authors of the
paper decided to examine the performance of LSTM recurrent neural networks for
modeling of voltage time series of the magnets. In order to address this
challenging task different network architectures and hyper-parameters were used
to achieve the best possible performance of the solution. The regression
results were measured in terms of RMSE for different number of future steps and
history length taken into account for the prediction. The best result of
RMSE=0.00104 was obtained for a network of 128 LSTM cells within the internal
layer and 16 steps history buffer
Balanced Quantization: An Effective and Efficient Approach to Quantized Neural Networks
Quantized Neural Networks (QNNs), which use low bitwidth numbers for
representing parameters and performing computations, have been proposed to
reduce the computation complexity, storage size and memory usage. In QNNs,
parameters and activations are uniformly quantized, such that the
multiplications and additions can be accelerated by bitwise operations.
However, distributions of parameters in Neural Networks are often imbalanced,
such that the uniform quantization determined from extremal values may under
utilize available bitwidth. In this paper, we propose a novel quantization
method that can ensure the balance of distributions of quantized values. Our
method first recursively partitions the parameters by percentiles into balanced
bins, and then applies uniform quantization. We also introduce computationally
cheaper approximations of percentiles to reduce the computation overhead
introduced. Overall, our method improves the prediction accuracies of QNNs
without introducing extra computation during inference, has negligible impact
on training speed, and is applicable to both Convolutional Neural Networks and
Recurrent Neural Networks. Experiments on standard datasets including ImageNet
and Penn Treebank confirm the effectiveness of our method. On ImageNet, the
top-5 error rate of our 4-bit quantized GoogLeNet model is 12.7\%, which is
superior to the state-of-the-arts of QNNs
E-PUR: An Energy-Efficient Processing Unit for Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are a key technology for emerging
applications such as automatic speech recognition, machine translation or image
description. Long Short Term Memory (LSTM) networks are the most successful RNN
implementation, as they can learn long term dependencies to achieve high
accuracy. Unfortunately, the recurrent nature of LSTM networks significantly
constrains the amount of parallelism and, hence, multicore CPUs and many-core
GPUs exhibit poor efficiency for RNN inference. In this paper, we present
E-PUR, an energy-efficient processing unit tailored to the requirements of LSTM
computation. The main goal of E-PUR is to support large recurrent neural
networks for low-power mobile devices. E-PUR provides an efficient hardware
implementation of LSTM networks that is flexible to support diverse
applications. One of its main novelties is a technique that we call Maximizing
Weight Locality (MWL), which improves the temporal locality of the memory
accesses for fetching the synaptic weights, reducing the memory requirements by
a large extent. Our experimental results show that E-PUR achieves real-time
performance for different LSTM networks, while reducing energy consumption by
orders of magnitude with respect to general-purpose processors and GPUs, and it
requires a very small chip area. Compared to a modern mobile SoC, an NVIDIA
Tegra X1, E-PUR provides an average energy reduction of 92x
FastWave: Accelerating Autoregressive Convolutional Neural Networks on FPGA
Autoregressive convolutional neural networks (CNNs) have been widely
exploited for sequence generation tasks such as audio synthesis, language
modeling and neural machine translation. WaveNet is a deep autoregressive CNN
composed of several stacked layers of dilated convolution that is used for
sequence generation. While WaveNet produces state-of-the art audio generation
results, the naive inference implementation is quite slow; it takes a few
minutes to generate just one second of audio on a high-end GPU. In this work,
we develop the first accelerator platform~\textit{FastWave} for autoregressive
convolutional neural networks, and address the associated design challenges. We
design the Fast-Wavenet inference model in Vivado HLS and perform a wide range
of optimizations including fixed-point implementation, array partitioning and
pipelining. Our model uses a fully parameterized parallel architecture for fast
matrix-vector multiplication that enables per-layer customized latency
fine-tuning for further throughput improvement. Our experiments comparatively
assess the trade-off between throughput and resource utilization for various
optimizations. Our best WaveNet design on the Xilinx XCVU13P FPGA that uses
only on-chip memory, achieves 66 faster generation speed compared to CPU
implementation and 11 faster generation speed than GPU implementation.Comment: Published as a conference paper at ICCAD 201
The model of an anomaly detector for HiLumi LHC magnets based on Recurrent Neural Networks and adaptive quantization
This paper focuses on an examination of an applicability of Recurrent Neural
Network models for detecting anomalous behavior of the CERN superconducting
magnets. In order to conduct the experiments, the authors designed and
implemented an adaptive signal quantization algorithm and a custom GRU-based
detector and developed a method for the detector parameters selection. Three
different datasets were used for testing the detector. Two artificially
generated datasets were used to assess the raw performance of the system
whereas the 231 MB dataset composed of the signals acquired from HiLumi magnets
was intended for real-life experiments and model training. Several different
setups of the developed anomaly detection system were evaluated and compared
with state-of-the-art OC-SVM reference model operating on the same data. The
OC-SVM model was equipped with a rich set of feature extractors accounting for
a range of the input signal properties. It was determined in the course of the
experiments that the detector, along with its supporting design methodology,
reaches F1 equal or very close to 1 for almost all test sets. Due to the
profile of the data, the best_length setup of the detector turned out to
perform the best among all five tested configuration schemes of the detection
system. The quantization parameters have the biggest impact on the overall
performance of the detector with the best values of input/output grid equal to
16 and 8, respectively. The proposed solution of the detection significantly
outperformed OC-SVM-based detector in most of the cases, with much more stable
performance across all the datasets.Comment: Related to arXiv:1702.0083