1,447 research outputs found
Towards hardware acceleration of neuroevolution for multimedia processing applications on mobile devices
This paper addresses the problem of accelerating large artificial neural networks (ANN), whose topology and weights can evolve via the use of a genetic algorithm. The proposed digital hardware architecture is capable of processing any evolved network topology, whilst at the same time providing a good trade off between throughput, area and power consumption. The latter is vital for a longer battery life on mobile devices. The architecture uses multiple parallel arithmetic units in each processing element (PE). Memory partitioning and data caching are used to minimise the effects of PE pipeline stalling. A first order minimax polynomial approximation scheme, tuned via a genetic algorithm, is used for the activation function generator. Efficient arithmetic circuitry, which leverages modified Booth recoding, column compressors and carry save adders, is adopted throughout the design
Chipmunk: A Systolically Scalable 0.9 mm, 3.08 Gop/s/mW @ 1.2 mW Accelerator for Near-Sensor Recurrent Neural Network Inference
Recurrent neural networks (RNNs) are state-of-the-art in voice
awareness/understanding and speech recognition. On-device computation of RNNs
on low-power mobile and wearable devices would be key to applications such as
zero-latency voice-based human-machine interfaces. Here we present Chipmunk, a
small (<1 mm) hardware accelerator for Long-Short Term Memory RNNs in UMC
65 nm technology capable to operate at a measured peak efficiency up to 3.08
Gop/s/mW at 1.24 mW peak power. To implement big RNN models without incurring
in huge memory transfer overhead, multiple Chipmunk engines can cooperate to
form a single systolic array. In this way, the Chipmunk architecture in a 75
tiles configuration can achieve real-time phoneme extraction on a demanding RNN
topology proposed by Graves et al., consuming less than 13 mW of average power
FireFly: A High-Throughput and Reconfigurable Hardware Accelerator for Spiking Neural Networks
Spiking neural networks (SNNs) have been widely used due to their strong
biological interpretability and high energy efficiency. With the introduction
of the backpropagation algorithm and surrogate gradient, the structure of
spiking neural networks has become more complex, and the performance gap with
artificial neural networks has gradually decreased. However, most SNN hardware
implementations for field-programmable gate arrays (FPGAs) cannot meet
arithmetic or memory efficiency requirements, which significantly restricts the
development of SNNs. They do not delve into the arithmetic operations between
the binary spikes and synaptic weights or assume unlimited on-chip RAM
resources by using overly expensive devices on small tasks. To improve
arithmetic efficiency, we analyze the neural dynamics of spiking neurons,
generalize the SNN arithmetic operation to the multiplex-accumulate operation,
and propose a high-performance implementation of such operation by utilizing
the DSP48E2 hard block in Xilinx Ultrascale FPGAs. To improve memory
efficiency, we design a memory system to enable efficient synaptic weights and
membrane voltage memory access with reasonable on-chip RAM consumption.
Combining the above two improvements, we propose an FPGA accelerator that can
process spikes generated by the firing neuron on-the-fly (FireFly). FireFly is
implemented on several FPGA edge devices with limited resources but still
guarantees a peak performance of 5.53TSOP/s at 300MHz. As a lightweight
accelerator, FireFly achieves the highest computational density efficiency
compared with existing research using large FPGA devices
In-Datacenter Performance Analysis of a Tensor Processing Unit
Many architects believe that major improvements in cost-energy-performance
must now come from domain-specific hardware. This paper evaluates a custom
ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since
2015 that accelerates the inference phase of neural networks (NN). The heart of
the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak
throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed
on-chip memory. The TPU's deterministic execution model is a better match to
the 99th-percentile response-time requirement of our NN applications than are
the time-varying optimizations of CPUs and GPUs (caches, out-of-order
execution, multithreading, multiprocessing, prefetching, ...) that help average
throughput more than guaranteed latency. The lack of such features helps
explain why, despite having myriad MACs and a big memory, the TPU is relatively
small and low power. We compare the TPU to a server-class Intel Haswell CPU and
an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters.
Our workload, written in the high-level TensorFlow framework, uses production
NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters'
NN inference demand. Despite low utilization for some applications, the TPU is
on average about 15X - 30X faster than its contemporary GPU or CPU, with
TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the
TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and
200X the CPU.Comment: 17 pages, 11 figures, 8 tables. To appear at the 44th International
Symposium on Computer Architecture (ISCA), Toronto, Canada, June 24-28, 201
MATIC: Learning Around Errors for Efficient Low-Voltage Neural Network Accelerators
As a result of the increasing demand for deep neural network (DNN)-based
services, efforts to develop dedicated hardware accelerators for DNNs are
growing rapidly. However,while accelerators with high performance and
efficiency on convolutional deep neural networks (Conv-DNNs) have been
developed, less progress has been made with regards to fully-connected DNNs
(FC-DNNs). In this paper, we propose MATIC (Memory Adaptive Training with
In-situ Canaries), a methodology that enables aggressive voltage scaling of
accelerator weight memories to improve the energy-efficiency of DNN
accelerators. To enable accurate operation with voltage overscaling, MATIC
combines the characteristics of destructive SRAM reads with the error
resilience of neural networks in a memory-adaptive training process.
Furthermore, PVT-related voltage margins are eliminated using bit-cells from
synaptic weights as in-situ canaries to track runtime environmental variation.
Demonstrated on a low-power DNN accelerator that we fabricate in 65 nm CMOS,
MATIC enables up to 60-80 mV of voltage overscaling (3.3x total energy
reduction versus the nominal voltage), or 18.6x application error reduction.Comment: 6 pages, 12 figures, 3 tables. Published at Design, Automation and
Test in Europe Conference and Exhibition (DATE) 201
- …