1,601 research outputs found
ASC: A stream compiler for computing with FPGAs
Published versio
A Methodology to Design Pipelined Simulated Annealing Kernel Accelerators on Space-Borne Field-Programmable Gate Arrays
Increased levels of science objectives expected from spacecraft systems necessitate the ability to carry out fast on-board autonomous mission planning and scheduling. Heterogeneous radiation-hardened Field Programmable Gate Arrays (FPGAs) with embedded multiplier and memory modules are well suited to support the acceleration of scheduling algorithms. A methodology to design circuits specifically to accelerate Simulated Annealing Kernels (SAKs) in event scheduling algorithms is shown. The main contribution of this thesis is the low complexity scoring calculation used for the heuristic mapping algorithm used to balance resource allocation across a coarse-grained pipelined data-path. The methodology was exercised over various kernels with different cost functions and problem sizes. These test cases were benchedmarked for execution time, resource usage, power, and energy on a Xilinx Virtex 4 LX QR 200 FPGA and a BAE RAD 750 microprocessor
FPGA Energy Efficiency by Leveraging Thermal Margin
Cutting edge FPGAs are not energy efficient as conventionally presumed to be,
and therefore, aggressive power-saving techniques have become imperative. The
clock rate of an FPGA-mapped design is set based on worst-case conditions to
ensure reliable operation under all circumstances. This usually leaves a
considerable timing margin that can be exploited to reduce power consumption by
scaling voltage without lowering clock frequency. There are hurdles for such
opportunistic voltage scaling in FPGAs because (a) critical paths change with
designs, making timing evaluation difficult as voltage changes, (b) each FPGA
resource has particular power-delay trade-off with voltage, (c) data corruption
of configuration cells and memory blocks further hampers voltage scaling. In
this paper, we propose a systematical approach to leverage the available
thermal headroom of FPGA-mapped designs for power and energy improvement. By
comprehensively analyzing the timing and power consumption of FPGA building
blocks under varying temperatures and voltages, we propose a thermal-aware
voltage scaling flow that effectively utilizes the thermal margin to reduce
power consumption without degrading performance. We show the proposed flow can
be employed for energy optimization as well, whereby power consumption and
delay are compromised to accomplish the tasks with minimum energy. Lastly, we
propose a simulation framework to be able to examine the efficiency of the
proposed method for other applications that are inherently tolerant to a
certain amount of error, granting further power saving opportunity.
Experimental results over a set of industrial benchmarks indicate up to 36%
power reduction with the same performance, and 66% total energy saving when
energy is the optimization target.Comment: Accepted in IEEE International Conference on Computer Design (ICCD)
201
Hardware implementation of intelligent systems
interior, interior vie
MFPA: Mixed-Signal Field Programmable Array for Energy-Aware Compressive Signal Processing
Compressive Sensing (CS) is a signal processing technique which reduces the number of samples taken per frame to decrease energy, storage, and data transmission overheads, as well as reducing time taken for data acquisition in time-critical applications. The tradeoff in such an approach is increased complexity of signal reconstruction. While several algorithms have been developed for CS signal reconstruction, hardware implementation of these algorithms is still an area of active research. Prior work has sought to utilize parallelism available in reconstruction algorithms to minimize hardware overheads; however, such approaches are limited by the underlying limitations in CMOS technology. Herein, the MFPA (Mixed-signal Field Programmable Array) approach is presented as a hybrid spin-CMOS reconfigurable fabric specifically designed for implementation of CS data sampling and signal reconstruction. The resulting fabric consists of 1) slice-organized analog blocks providing amplifiers, transistors, capacitors, and Magnetic Tunnel Junctions (MTJs) which are configurable to achieving square/square root operations required for calculating vector norms, 2) digital functional blocks which feature 6-input clockless lookup tables for computation of matrix inverse, and 3) an MRAM-based nonvolatile crossbar array for carrying out low-energy matrix-vector multiplication operations. The various functional blocks are connected via a global interconnect and spin-based analog-to-digital converters. Simulation results demonstrate significant energy and area benefits compared to equivalent CMOS digital implementations for each of the functional blocks used: this includes an 80% reduction in energy and 97% reduction in transistor count for the nonvolatile crossbar array, 80% standby power reduction and 25% reduced area footprint for the clockless lookup tables, and roughly 97% reduction in transistor count for a multiplier built using components from the analog blocks. Moreover, the proposed fabric yields 77% energy reduction compared to CMOS when used to implement CS reconstruction, in addition to latency improvements
LUXOR: An FPGA Logic Cell Architecture for Efficient Compressor Tree Implementations
We propose two tiers of modifications to FPGA logic cell architecture to
deliver a variety of performance and utilization benefits with only minor area
overheads. In the irst tier, we augment existing commercial logic cell
datapaths with a 6-input XOR gate in order to improve the expressiveness of
each element, while maintaining backward compatibility. This new architecture
is vendor-agnostic, and we refer to it as LUXOR. We also consider a secondary
tier of vendor-speciic modifications to both Xilinx and Intel FPGAs, which we
refer to as X-LUXOR+ and I-LUXOR+ respectively. We demonstrate that compressor
tree synthesis using generalized parallel counters (GPCs) is further improved
with the proposed modifications. Using both the Intel adaptive logic module and
the Xilinx slice at the 65nm technology node for a comparative study, it is
shown that the silicon area overhead is less than 0.5% for LUXOR and 5-6% for
LUXOR+, while the delay increments are 1-6% and 3-9% respectively. We
demonstrate that LUXOR can deliver an average reduction of 13-19% in logic
utilization on micro-benchmarks from a variety of domains.BNN benchmarks
benefit the most with an average reduction of 37-47% in logic utilization,
which is due to the highly-efficient mapping of the XnorPopcount operation on
our proposed LUXOR+ logic cells.Comment: In Proceedings of the 2020 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA'20), February 23-25, 2020, Seaside, CA,
US
- …