Search CORE

1,447 research outputs found

Towards hardware acceleration of neuroevolution for multimedia processing applications on mobile devices

Author: A.R. Omondi
B. Gaines
B. Widrow
B.D. Brown
D.B. Fogel
J. Holt
J.L. Hennessy
K. Stanley
K.O. Stanley
L. Reyneri
S. Kung
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

This paper addresses the problem of accelerating large artificial neural networks (ANN), whose topology and weights can evolve via the use of a genetic algorithm. The proposed digital hardware architecture is capable of processing any evolved network topology, whilst at the same time providing a good trade off between throughput, area and power consumption. The latter is vital for a longer battery life on mobile devices. The architecture uses multiple parallel arithmetic units in each processing element (PE). Memory partitioning and data caching are used to minimise the effects of PE pipeline stalling. A first order minimax polynomial approximation scheme, tuned via a genetic algorithm, is used for the activation function generator. Efficient arithmetic circuitry, which leverages modified Booth recoding, column compressors and carry save adders, is adopted throughout the design

Crossref

Irish Universities

DCU Online Research Access Service

Chipmunk: A Systolically Scalable 0.9 mm ${}^2$ , 3.08 Gop/s/mW @ 1.2 mW Accelerator for Near-Sensor Recurrent Neural Network Inference

Author: Benini Luca
Cavigelli Lukas
Conti Francesco
Paulin Gianna
Susmelj Igor
Publication venue
Publication date: 01/01/2018
Field of study

Recurrent neural networks (RNNs) are state-of-the-art in voice awareness/understanding and speech recognition. On-device computation of RNNs on low-power mobile and wearable devices would be key to applications such as zero-latency voice-based human-machine interfaces. Here we present Chipmunk, a small (<1 mm

{}^2

) hardware accelerator for Long-Short Term Memory RNNs in UMC 65 nm technology capable to operate at a measured peak efficiency up to 3.08 Gop/s/mW at 1.24 mW peak power. To implement big RNN models without incurring in huge memory transfer overhead, multiple Chipmunk engines can cooperate to form a single systolic array. In this way, the Chipmunk architecture in a 75 tiles configuration can achieve real-time phoneme extraction on a demanding RNN topology proposed by Graves et al., consuming less than 13 mW of average power

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

FireFly: A High-Throughput and Reconfigurable Hardware Accelerator for Spiking Neural Networks

Author: Li Jindong
Shen Guobin
Yi Zeng
Zhang Qian
Zhao Dongcheng
Publication venue
Publication date: 22/01/2023
Field of study

Spiking neural networks (SNNs) have been widely used due to their strong biological interpretability and high energy efficiency. With the introduction of the backpropagation algorithm and surrogate gradient, the structure of spiking neural networks has become more complex, and the performance gap with artificial neural networks has gradually decreased. However, most SNN hardware implementations for field-programmable gate arrays (FPGAs) cannot meet arithmetic or memory efficiency requirements, which significantly restricts the development of SNNs. They do not delve into the arithmetic operations between the binary spikes and synaptic weights or assume unlimited on-chip RAM resources by using overly expensive devices on small tasks. To improve arithmetic efficiency, we analyze the neural dynamics of spiking neurons, generalize the SNN arithmetic operation to the multiplex-accumulate operation, and propose a high-performance implementation of such operation by utilizing the DSP48E2 hard block in Xilinx Ultrascale FPGAs. To improve memory efficiency, we design a memory system to enable efficient synaptic weights and membrane voltage memory access with reasonable on-chip RAM consumption. Combining the above two improvements, we propose an FPGA accelerator that can process spikes generated by the firing neuron on-the-fly (FireFly). FireFly is implemented on several FPGA edge devices with limited resources but still guarantees a peak performance of 5.53TSOP/s at 300MHz. As a lightweight accelerator, FireFly achieves the highest computational density efficiency compared with existing research using large FPGA devices

arXiv.org e-Print Archive

In-Datacenter Performance Analysis of a Tensor Processing Unit

Author: Agrawal Gaurav
Bajwa Raminder
Bates Sarah
Bhatia Suresh
Boden Nan
Borchers Al
Boyle Rick
Cantin Pierre-luc
Chao Clifford
Clark Chris
Coriell Jeremy
Daley Mike
Dau Matt
Dean Jeffrey
Gelb Ben
Ghaemmaghami Tara Vazir
Gottipati Rajendra
Gulland William
Hagmann Robert
Ho C. Richard
Hogberg Doug
Hu John
Hundt Robert
Hurt Dan
Ibarz Julian
Jaffey Aaron
Jaworski Alek
Jouppi Norman P.
Kaplan Alexander
Khaitan Harshit
Koch Andy
Kumar Naveen
Lacy Steve
Laudon James
Law James
Le Diemthu
Leary Chris
Liu Zhuyuan
Lucke Kyle
Lundin Alan
MacKean Gordon
Maggiore Adriana
Mahony Maire
Miller Kieran
Nagarajan Rahul
Narayanaswami Ravi
Ni Ray
Nix Kathy
Norrie Thomas
Omernick Mark
Patil Nishant
Patterson David
Penukonda Narayana
Phelps Andy
Ross Jonathan
Ross Matt
Salek Amir
Samadiani Emad
Severn Chris
Sizikov Gregory
Snelham Matthew
Souter Jed
Steinberg Dan
Swing Andy
Tan Mercedes
Thorson Gregory
Tian Bo
Toma Horia
Tuttle Erick
Vasudevan Vijay
Walter Richard
Wang Walter
Wilcox Eric
Yoon Doe Hyun
Young Cliff
Publication venue
Publication date: 16/04/2017
Field of study

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.Comment: 17 pages, 11 figures, 8 tables. To appear at the 44th International Symposium on Computer Architecture (ISCA), Toronto, Canada, June 24-28, 201

arXiv.org e-Print Archive

Crossref

MATIC: Learning Around Errors for Efficient Low-Voltage Neural Network Accelerators

Author: alvira
bang
bishop
girard
gupta
krizhevsky
lecun
nissen
Publication venue
Publication date: 23/03/2018
Field of study

As a result of the increasing demand for deep neural network (DNN)-based services, efforts to develop dedicated hardware accelerators for DNNs are growing rapidly. However,while accelerators with high performance and efficiency on convolutional deep neural networks (Conv-DNNs) have been developed, less progress has been made with regards to fully-connected DNNs (FC-DNNs). In this paper, we propose MATIC (Memory Adaptive Training with In-situ Canaries), a methodology that enables aggressive voltage scaling of accelerator weight memories to improve the energy-efficiency of DNN accelerators. To enable accurate operation with voltage overscaling, MATIC combines the characteristics of destructive SRAM reads with the error resilience of neural networks in a memory-adaptive training process. Furthermore, PVT-related voltage margins are eliminated using bit-cells from synaptic weights as in-situ canaries to track runtime environmental variation. Demonstrated on a low-power DNN accelerator that we fabricate in 65 nm CMOS, MATIC enables up to 60-80 mV of voltage overscaling (3.3x total energy reduction versus the nominal voltage), or 18.6x application error reduction.Comment: 6 pages, 12 figures, 3 tables. Published at Design, Automation and Test in Europe Conference and Exhibition (DATE) 201

arXiv.org e-Print Archive

Crossref