1,504 research outputs found
Cycle-accurate multicore performance models on FPGAs
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 159-165).The goal of this project is to improve computer architecture by accelerating cycle-accurate performance modeling of multicore processors using FPGAs. Contributions include a distributed technique controlling simulation on a highly-parallel substrate, hardware design techniques to reduce development effort, and a specific framework for modeling shared-memory multicore processors paired with realistic On-Chip Networks.by Michael Pellauer.Ph.D
An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration
We empirically evaluate an undervolting technique, i.e., underscaling the
circuit supply voltage below the nominal level, to improve the power-efficiency
of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable
Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing
faults due to excessive circuit latency increase. We evaluate the
reliability-power trade-off for such accelerators. Specifically, we
experimentally study the reduced-voltage operation of multiple components of
real FPGAs, characterize the corresponding reliability behavior of CNN
accelerators, propose techniques to minimize the drawbacks of reduced-voltage
operation, and combine undervolting with architectural CNN optimization
techniques, i.e., quantization and pruning. We investigate the effect of
environmental temperature on the reliability-power trade-off of such
accelerators. We perform experiments on three identical samples of modern
Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification
CNN benchmarks. This approach allows us to study the effects of our
undervolting technique for both software and hardware variability. We achieve
more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain
is the result of eliminating the voltage guardband region, i.e., the safe
voltage region below the nominal level that is set by FPGA vendor to ensure
correct functionality in worst-case environmental and circuit conditions. 43%
of the power-efficiency gain is due to further undervolting below the
guardband, which comes at the cost of accuracy loss in the CNN accelerator. We
evaluate an effective frequency underscaling technique that prevents this
accuracy loss, and find that it reduces the power-efficiency gain from 43% to
25%.Comment: To appear at the DSN 2020 conferenc
Seeing Shapes in Clouds: On the Performance-Cost trade-off for Heterogeneous Infrastructure-as-a-Service
In the near future FPGAs will be available by the hour, however this new
Infrastructure as a Service (IaaS) usage mode presents both an opportunity and
a challenge: The opportunity is that programmers can potentially trade
resources for performance on a much larger scale, for much shorter periods of
time than before. The challenge is in finding and traversing the trade-off for
heterogeneous IaaS that guarantees increased resources result in the greatest
possible increased performance. Such a trade-off is Pareto optimal. The Pareto
optimal trade-off for clusters of heterogeneous resources can be found by
solving multiple, multi-objective optimisation problems, resulting in an
optimal allocation of tasks to the available platforms. Solving these
optimisation programs can be done using simple heuristic approaches or formal
Mixed Integer Linear Programming (MILP) techniques. When pricing 128 financial
options using a Monte Carlo algorithm upon a heterogeneous cluster of Multicore
CPU, GPU and FPGA platforms, the MILP approach produces a trade-off that is up
to 110% faster than a heuristic approach, and over 50% cheaper. These results
suggest that high quality performance-resource trade-offs of heterogeneous IaaS
are best realised through a formal optimisation approach.Comment: Presented at Second International Workshop on FPGAs for Software
Programmers (FSP 2015) (arXiv:1508.06320
High-Level Debugging and Verification for FPGA-Based Multicore Architectures
Simulators are key tools for computer architecture research. However, multicore architectures represent a highly complex challenge for software simulators, which may suffer from fidelity loss and long execution times. FPGAs can simulate multicore architectures with scalable performance and high accuracy, but the difficulty of debugging could hinder their adoption.
In this paper we propose several techniques for inspection, debugging and verification of multicore architectures, both for software-based and FPGA-based simulations. These debugging extensions are cycle-accurate and unobtrusive. As a proof of concept, we have developed a 24-core RISC multiprocessor that runs the Linux Kernel, for which we provide three simulation modes: a fast, functional simulation; a detailed, cycle-accurate simulation; and a FPGA-based simulation. Our platform can run up to 24 cores and perform full-system verification at 17 million instructions per second.Peer ReviewedPostprint (author's final draft
A general technique for deterministic model-cycle-level debugging
Efficient use of FPGA resources requires FPGA-based performance models of complex hardware to implement one model cycle, i.e., one time-step of the original synchronous system, in several implementation cycles. Generally implementation cycles have no simple relationship with model cycles, and it is tricky to reconstruct the state of the synchronous system at the model-cycle boundaries if only implementation-cycle-level control and information is provided. A good debugging facility needs to provide: complete control over the functioning of the target design being simulated; fast and easy access to all the significant target design state for both monitoring and modification; and some means of accomplishing deterministic execution when the target design is a multicore processor running a parallel application. Moreover, these features need to be provided in a manner which does not incur substantial resource and performance penalties. In this paper, we present a debugging technique based on the LI-BDN theory. We show how the technique facilitates deterministic model-cycle-level debugging. We used it to build the debugging infrastructure for Arete, which is an FPGA-based cycle-accurate multicore simulator. The resource and performance penalties of our debugging technique are minimal; in Arete the debugging infrastructure has area and performance overheads of 5% and 6%, respectively.IBM Researc
- …