Search CORE

96,312 research outputs found

An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration

Author: Ergin Oguz
Kestelman Adrian Cristal
Koc Fahrettin
Mutlu Onur
Onural Erhan Baturay
Salami Behzad
Sarbazi-Azad Hamid
Unsal Osman S.
Yuksel Ismail Emir
Publication venue
Publication date: 01/01/2020
Field of study

We empirically evaluate an undervolting technique, i.e., underscaling the circuit supply voltage below the nominal level, to improve the power-efficiency of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing faults due to excessive circuit latency increase. We evaluate the reliability-power trade-off for such accelerators. Specifically, we experimentally study the reduced-voltage operation of multiple components of real FPGAs, characterize the corresponding reliability behavior of CNN accelerators, propose techniques to minimize the drawbacks of reduced-voltage operation, and combine undervolting with architectural CNN optimization techniques, i.e., quantization and pruning. We investigate the effect of environmental temperature on the reliability-power trade-off of such accelerators. We perform experiments on three identical samples of modern Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification CNN benchmarks. This approach allows us to study the effects of our undervolting technique for both software and hardware variability. We achieve more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain is the result of eliminating the voltage guardband region, i.e., the safe voltage region below the nominal level that is set by FPGA vendor to ensure correct functionality in worst-case environmental and circuit conditions. 43% of the power-efficiency gain is due to further undervolting below the guardband, which comes at the cost of accuracy loss in the CNN accelerator. We evaluate an effective frequency underscaling technique that prevents this accuracy loss, and find that it reduces the power-efficiency gain from 43% to 25%.Comment: To appear at the DSN 2020 conferenc

arXiv.org e-Print Archive

Crossref

UPCommons. Portal del coneixement obert de la UPC

TOBB ETÜ Institutional Repository

Variant X-Tree Clock Distribution Network and Its Performance Evaluations

Author
Publication venue: 'Institute of Electronics, Information and Communications Engineers (IEICE)'
Publication date: 10/01/2007
Field of study

Future University Hakodate Academic Archive

FPSA: A Full System Stack Solution for Reconfigurable ReRAM-based NN Accelerator Architecture

Author: Hu Xing
Ji Yu
Li Shuangchen
Wang Peiqi
Xie Xinfeng
Xie Yuan
Zhang Youhui
Zhang Youyang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 27/01/2019
Field of study

Neural Network (NN) accelerators with emerging ReRAM (resistive random access memory) technologies have been investigated as one of the promising solutions to address the \textit{memory wall} challenge, due to the unique capability of \textit{processing-in-memory} within ReRAM-crossbar-based processing elements (PEs). However, the high efficiency and high density advantages of ReRAM have not been fully utilized due to the huge communication demands among PEs and the overhead of peripheral circuits. In this paper, we propose a full system stack solution, composed of a reconfigurable architecture design, Field Programmable Synapse Array (FPSA) and its software system including neural synthesizer, temporal-to-spatial mapper, and placement & routing. We highly leverage the software system to make the hardware design compact and efficient. To satisfy the high-performance communication demand, we optimize it with a reconfigurable routing architecture and the placement & routing tool. To improve the computational density, we greatly simplify the PE circuit with the spiking schema and then adopt neural synthesizer to enable the high density computation-resources to support different kinds of NN operations. In addition, we provide spiking memory blocks (SMBs) and configurable logic blocks (CLBs) in hardware and leverage the temporal-to-spatial mapper to utilize them to balance the storage and computation requirements of NN. Owing to the end-to-end software system, we can efficiently deploy existing deep neural networks to FPSA. Evaluations show that, compared to one of state-of-the-art ReRAM-based NN accelerators, PRIME, the computational density of FPSA improves by 31x; for representative NNs, its inference performance can achieve up to 1000x speedup.Comment: Accepted by ASPLOS 201

arXiv.org e-Print Archive

Crossref

Ring oscillator clocks and margins

Author: Cortadella Jordi
Lupon Navazo Marc
Moreno Vega Alberto
Roca Pérez Antoni
Sapatnekar Sachin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

How much margin do we have to add to the delay lines of a bundled-data circuit? This paper is an attempt to give a methodical answer to this question, taking into account all sources of variability and the existing EDA machinery for timing analysis and sign-off. The paper is based on the study of the margins of a ring oscillator that substitutes a PLL as clock generator. A timing model is proposed that shows that a 12% margin for delay lines can be sufficient to cover variability in a 65nm technology. In a typical scenario, performance and energy improvements between 15% and 35% can be obtained by using a ring oscillator instead of a PLL. The paper concludes that a synchronous circuit with a ring oscillator clock shows similar benefits in performance and energy as those of bundled-data asynchronous circuits.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

The PreAmplifier ShAper for the ALICE TPC-Detector

Author: Aamodt
Chang
De Geronimo
Esteve Bosch
Gramegna
H. Oeschler
H.A. Gustafsson
H.K. Soltveit
J. Stachel
L. Musa
L. Osterman
P. Braun-Munzinger
S. Lang
U. Bonnes
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

In this paper the PreAmplifier ShAper (PASA) for the Time Projection Chamber (TPC) of the ALICE experiment at LHC is presented. The ALICE TPC PASA is an ASIC that integrates 16 identical channels, each consisting of Charge Sensitive Amplifiers (CSA) followed by a Pole-Zero network, self-adaptive bias network, two second-order bridged-T filters, two non-inverting level shifters and a start-up circuit. The circuit is optimized for a detector capacitance of 18-25 pF. For an input capacitance of 25 pF, the PASA features a conversion gain of 12.74 mV/fC, a peaking time of 160 ns, a FWHM of 190 ns, a power consumption of 11.65 mW/ch and an equivalent noise charge of 244e + 17e/pF. The circuit recovers smoothly to the baseline in about 600 ns. An integral non-linearity of 0.19% with an output swing of about 2.1 V is also achieved. The total area of the chip is 18 mm

^2

and is implemented in AMS's C35B3C1 0.35 micron CMOS technology. Detailed characterization test were performed on about 48000 PASA circuits before mounting them on the ALICE TPC front-end cards. After more than two years of operation of the ALICE TPC with p-p and Pb-Pb collisions, the PASA has demonstrated to fulfill all requirements

arXiv.org e-Print Archive

Lund University Publications

Crossref

CERN Document Server

GSI Repository

An Energy and Performance Exploration of Network-on-Chip Architectures

Author: Arnab Banerjee
Gerard J. M. Smit
Pascal T. Wolkotte
Robert D. Mullins
Senior Member
Simon W. Moore
Student Member
Publication venue: IEEE Circuits and Systems Society
Publication date: 01/01/2009
Field of study

In this paper, we explore the designs of a circuit-switched router, a wormhole router, a quality-of-service (QoS) supporting virtual channel router and a speculative virtual channel router and accurately evaluate the energy-performance tradeoffs they offer. Power results from the designs placed and routed in a 90-nm CMOS process show that all the architectures dissipate significant idle state power. The additional energy required to route a packet through the router is then shown to be dominated by the data path. This leads to the key result that, if this trend continues, the use of more elaborate control can be justified and will not be immediately limited by the energy budget. A performance analysis also shows that dynamic resource allocation leads to the lowest network latencies, while static allocation may be used to meet QoS goals. Combining the power and performance figures then allows an energy-latency product to be calculated to judge the efficiency of each of the networks. The speculative virtual channel router was shown to have a very similar efficiency to the wormhole router, while providing a better performance, supporting its use for general purpose designs. Finally, area metrics are also presented to allow a comparison of implementation costs

CiteSeerX

University of Twente Research Information

Statistical Power Supply Dynamic Noise Prediction in Hierarchical Power Grid and Package Networks

Author: Casu
Chen
Deutsch
G. Piccinini
Hashemi
Kozhaya
Lee
M. Graziano
Mezhiba
Zhao
Zheng
Publication venue: Elsevier
Publication date: 01/01/2008
Field of study

One of the most crucial high performance systems-on-chip design challenge is to front their power supply noise sufferance due to high frequencies, huge number of functional blocks and technology scaling down. Marking a difference from traditional post physical-design static voltage drop analysis, /a priori dynamic voltage drop/evaluation is the focus of this work. It takes into account transient currents and on-chip and package /RLC/ parasitics while exploring the power grid design solution space: Design countermeasures can be thus early defined and long post physical-design verification cycles can be shortened. As shown by an extensive set of results, a carefully extracted and modular grid library assures realistic evaluation of parasitics impact on noise and facilitates the power network construction; furthermore statistical analysis guarantees a correct current envelope evaluation and Spice simulations endorse reliable result

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino