1,201 research outputs found
Mapping and Scheduling of Directed Acyclic Graphs on An FPFA Tile
An architecture for a hand-held multimedia device requires components that are energy-efficient, flexible, and provide high performance. In the CHAMELEON [4] project we develop a coarse grained reconfigurable device for DSP-like algorithms, the so-called Field Programmable Function Array (FPFA). The FPFA devices are reminiscent to FPGAs, but with a matrix of Processing Parts (PP) instead of CLBs. The design of the FPFA focuses on: (1) Keeping each PP small to maximize the number of PPs that can fit on a chip; (2) providing sufficient flexibility; (3) Low energy consumption; (4) Exploiting the maximum amount of parallelism; (5) A strong support tool for FPFA-based applications. The challenge in providing compiler support for the FPFA-based design stems from the flexibility of the FPFA structure. If we do not use the characteristics of the FPFA structure properly, the advantages of an FPFA may become its disadvantages. The GECKO1project focuses on this problem. In this paper, we present a mapping and scheduling scheme for applications running on one FPFA tile. Applications are written in C and C code is translated to a Directed Acyclic Graphs (DAG) [4]. This scheme can map a DAG directly onto the reconfigurable PPs of an FPFA tile. It tries to achieve low power consumption by exploiting locality of reference and high performance by exploiting maximum parallelism
Coarse-grained reconfigurable array architectures
Coarse-Grained ReconïŹgurable Array (CGRA) architectures accelerate the same inner loops that beneïŹt from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efïŹciently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on ïŹexibility, performance, and power-efïŹciency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual ïŹne-tuning of source code
Generic Connectivity-Based CGRA Mapping via Integer Linear Programming
Coarse-grained reconfigurable architectures (CGRAs) are programmable logic
devices with large coarse-grained ALU-like logic blocks, and multi-bit
datapath-style routing. CGRAs often have relatively restricted data routing
networks, so they attract CAD mapping tools that use exact methods, such as
Integer Linear Programming (ILP). However, tools that target general
architectures must use large constraint systems to fully describe an
architecture's flexibility, resulting in lengthy run-times. In this paper, we
propose to derive connectivity information from an otherwise generic device
model, and use this to create simpler ILPs, which we combine in an iterative
schedule and retain most of the exactness of a fully-generic ILP approach. This
new approach has a speed-up geometric mean of 5.88x when considering benchmarks
that do not hit a time-limit of 7.5 hours on the fully-generic ILP, and 37.6x
otherwise. This was measured using the set of benchmarks used to originally
evaluate the fully-generic approach and several more benchmarks representing
computation tasks, over three different CGRA architectures. All run-times of
the new approach are less than 20 minutes, with 90th percentile time of 410
seconds. The proposed mapping techniques are integrated into, and evaluated
using the open-source CGRA-ME architecture modelling and exploration framework.Comment: 8 pages of content; 8 figures; 3 tables; to appear in FCCM 2019; Uses
the CGRA-ME framework at http://cgra-me.ece.utoronto.ca
Low Power Processor Architectures and Contemporary Techniques for Power Optimization â A Review
The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER
Reconfigurable Architectures for Wireless Systems: Design Exploration and Integration Challenges
Mobile devices are severely power and area limited due to battery capacity
and system size. In many of these example systems, advanced features require
computationally complex signal processing on high-speed data streams for enhanced networking capabilities. Thus, mapping high-level communication and networking algorithms to system architectures is a complex and challenging procedure. An important challenge is to characterize the area, time, and power requirements of these embedded system modules and to use this information effectively to determine the architecture of programmable, reconfigurable, and fixed-function modules. In this paper, we will focus on application examples in wireless networking which highlight these challenges in reconfigurable systems integration.Nokia CorporationTexas Instruments IncorporatedNational Science Foundatio
Digital implementation of the cellular sensor-computers
Two different kinds of cellular sensor-processor architectures are used nowadays in various
applications. The first is the traditional sensor-processor architecture, where the sensor and the
processor arrays are mapped into each other. The second is the foveal architecture, in which a
small active fovea is navigating in a large sensor array. This second architecture is introduced
and compared here. Both of these architectures can be implemented with analog and digital
processor arrays. The efficiency of the different implementation types, depending on the used
CMOS technology, is analyzed. It turned out, that the finer the technology is, the better to use
digital implementation rather than analog
FPGA structures for high speed and low overhead dynamic circuit specialization
A Field Programmable Gate Array (FPGA) is a programmable digital electronic chip. The FPGA does not come with a predefined function from the manufacturer; instead, the developer has to define its function through implementing a digital circuit on the FPGA resources. The functionality of the FPGA can be reprogrammed as desired and hence the name âfield programmableâ. FPGAs are useful in small volume digital electronic products as the design of a digital custom chip is expensive. Changing the FPGA (also called configuring it) is done by changing the configuration data (in the form of bitstreams) that defines the FPGA functionality. These bitstreams are stored in a memory of the FPGA called configuration memory. The SRAM cells of LookUp Tables (LUTs), Block Random Access Memories (BRAMs) and DSP blocks together form the configuration memory of an FPGA. The configuration data can be modified according to the userâs needs to implement the user-defined hardware. The simplest way to program the configuration memory is to download the bitstreams using a JTAG interface. However, modern techniques such as Partial Reconfiguration (PR) enable us to configure a part in the configuration memory with partial bitstreams during run-time. The reconfiguration
is achieved by swapping in partial bitstreams into the configuration memory via a configuration interface called Internal Configuration Access Port (ICAP). The ICAP is a hardware primitive (macro) present in the FPGA used to access the
configuration memory internally by an embedded processor. The reconfiguration technique adds flexibility to use specialized ci rcuits that are more compact and more efficient t han t heir b ulky c ounterparts. An example of such an implementation is the use of specialized multipliers instead of big generic multipliers in an FIR implementation with constant coefficients. To specialize these circuits and reconfigure during the run-time, researchers at the HES group proposed the novel technique called parameterized reconfiguration that can be used to efficiently and automatically implement Dynamic Circuit Specialization (DCS) that is built on top of the Partial Reconfiguration method. It uses
the run-time reconfiguration technique that is tailored to implement a parameterized design. An application is said to be parameterized if some of its input values change much less frequently than the rest. These inputs are called parameters. Instead of implementing these parameters as regular inputs, in DCS these inputs are implemented as constants, and the application is optimized for the constants. For every change in parameter values, the design is re-optimized (specialized) during run-time and implemented by reconfiguring the optimized design for a new set of parameters. In DCS, the bitstreams of the parameterized design are expressed as Boolean functions of the parameters. For every infrequent change in parameters, a specialized FPGA configuration is generated by evaluating the corresponding Boolean functions, and the FPGA is reconfigured with the specialized configuration. A detailed study of overheads of DCS and providing suitable solutions with appropriate custom FPGA structures is the primary goal of the dissertation. I also suggest different improvements to the FPGA configuration memory architecture. After offering the custom FPGA structures, I investigated the role of DCS on FPGA overlays and the use of custom FPGA structures that help to reduce the overheads of DCS on FPGA overlays. By doing so, I hope I can convince the developer to use DCS (which now comes with minimal costs) in real-world applications. I start the investigations of overheads of DCS by implementing an adaptive FIR filter (using the DCS technique) on three different Xilinx FPGA platforms: Virtex-II Pro, Virtex-5, and Zynq-SoC. The study of how DCS behaves and what is its overhead in the evolution of the three FPGA platforms is the non-trivial basis to discover the costs of DCS. After that, I propose custom FPGA structures (reconfiguration controllers and reconfiguration drivers) to reduce the main overhead (reconfiguration time) of DCS. These structures not only reduce the reconfiguration time but also help curbing the power hungry part of the DCS system. After these chapters, I study the role of DCS on FPGA overlays. I investigate the effect of the proposed FPGA structures on Virtual-Coarse-Grained Reconfigurable Arrays (VCGRAs). I classify the VCGRA implementations into three types: the conventional VCGRA, partially parameterized VCGRA and fully parameterized VCGRA depending upon the level of parameterization. I have designed two variants of VCGRA grids for HPC image processing applications,
namely, the MAC grid and Pixie. Finally, I try to tackle the reconfiguration time overhead at the hardware level of the FPGA by customizing the FPGA configuration memory architecture. In this part of my research, I propose to use a parallel memory structure to improve the reconfiguration time of DCS drastically. However, this improvement comes with a
significant overhead of hardware resources which will need to be solved in future research on commercial FPGA configuration memory architectures
- âŠ