11 research outputs found

    Accurate Power Analysis for Near-Vt RRAM-based FPGA

    Get PDF
    Resistive Random Access Memory (RRAM)-based FPGA architectures employ RRAMs not only as memories to store the configuration but embed them in the datapaths of programmable routing resources to propagate signals with improved performances. Sources of power consumption have been intensively studied for conventional Static Random Access Memories (SRAM)-based FPGAs. However, very limited works focused so far on studying the power characteristics of RRAM-based FPGAs. In this paper, we first analyze the power characteristics of RRAM-based multiplexer at circuit level and then use electrical simulations to study power consumption of RRAM-based FPGA architectures. Experimental results show that RRAM-based FPGAs achieve a Power-Delay Product reduced by 50% compared to SRAM-based FPGA at nominal voltage and 20% compared to near-Vt SRAM-based FPGA, respectively

    Optimization Opportunities in RRAM-based FPGA Architectures

    Get PDF
    Static Random Access Memory (SRAM)-based routing multiplexers, whatever structure is employed, share a common limitation: their area, delay and power increase linearly with the input size. This property results in most SRAM-based FPGA architectures typically avoiding the use of large multiplexers. Resistive Random Access Memory (RRAM)- based multiplexers, built with one-level structure, have a unique advantage over SRAM-based multiplexers: their ideal delay is independent from the input size. This property allows RRAM-based FPGA architectures to use larger multiplexers than their SRAM-based counterparts, without generating any delay overhead. In this paper, by carefully considering the properties of RRAM multiplexers, we assess that current state-of-art architectural parameters for SRAM-based FPGAs cannot preserve optimality in the context of RRAM-based FPGAs. As a result, we propose that in RRAM-based FPGAs, (a) the routing tracks should be interconnected to Look-Up Table (LUT) inputs via a one-level crossbar, instead of through Connection Blocks and local routing; (b) the Switch Blocks should employ larger multiplexers; (c) length-2 wires should be used instead of length-4 wires. When operated in nominal voltage, the proposed RRAM-based FPGA architecture reduces area by 26%, delay by 39% and channel width by 13%, as compared to a SRAM-based FPGA with a classical architecture. When operated in the near-Vt regime, the proposed RRAM-based FPGA architecture improves Area-Delay Product by 42% and Power-Delay Product by 5x as compared to a classical SRAM-based FPGA at nominal voltage

    FPGA-SPICE: A Simulation-based Power Estimation Framework for FPGAs

    Get PDF
    Mainstream Field Programmable Gate Array (FPGA) power estimation tools are based on probabilistic activity estimation and analytical power models. The power consumption of the programmable resources of FPGAs is highly sensitive to their configurations. Due to their highly flexible nature, the configurations of FPGAs routing multiplexers or Look Up Tables (LUTs) are really different from a design to another but current analytical power models cannot accurately capture the associated power differences. In this paper, we introduce a simulation-based power estimation framework for FPGAs, called FPGA-SPICE, which supports any FPGA architecture that can be described with an architectural description language. Our power estimation engine automatically generates accurate SPICE netlists according to the FPGA configurations and enables precise power analysis of FPGA architectures. SPICE testbenches can be generated at different level of complexity, denoted as full-chip-level, grid-level and component-level testbenches. Full-chip-level testbenches dump the netlists associated with the complete FPGA fabric. To reduce simulation time, FPGA-SPICE can split the full-chip-level testbenches into grid-level testbenches, each of which consisting of a complete logic block netlist, or component-level testbenches, which consider individual circuit elements, i.e., multiplexers, LUTs, flip-flops, etc., separately. We show that the grid/component-level approach can achieve 14 x speed-up with a moderate 14% accuracy loss, compared to the full-chip level. We also use FPGA-SPICE to study the power characteristics of a commercial FPGA architecture at different technology nodes. Experimental results show that the global routing architecture consumes 50% of the total power, the local routing architecture claims for 40% of the total power, and the remaining 10% comes from the LUTs and flip-flops

    A high-performance low-power near-Vt RRAM-based FPGA

    Full text link

    FPGA-SPICE: A Simulation-Based Architecture Evaluation Framework for FPGAs

    Get PDF
    In this paper, we developed a simulation-based architecture evaluation framework for field-programmable gate arrays (FPGAs), called FPGA-SPICE, which enables automatic layout-level estimation and electrical simulations of FPGA architectures. FPGA-SPICE can automatically generate Verilog and SPICE netlists based on realistic FPGA configurations and a high-level eTtensible Markup Language-based FPGA architectural description language. The outputted Verilog netlists can be used to generate layouts of full FPGA fabrics through a semicustom design flow. SPICE simulation decks can be generated at three levels of complexity, namely, full-chip-level, grid-level, and component-level, providing different tradeoff between accuracy and simulation time. In order to enable such level of analysis, we presented two SPICE netlist partitioning techniques: loads extraction and parasitic net activity estimation. Electrical simulations showed that averaged over the selected benchmarks, the grid-/component-level approach can achieve 6.1x/7.5x execution speed-up with 9.9%/8.3% accuracy loss, respectively, compared to the full-chip level simulation. FPGA-SPICE was showcased through three different case studies: 1) an area breakdown analysis for static random access memory-based FPGAs, showing that configuration memories are a dominant factor; 2) a power breakdown comparison to analytical models, analyzing the source of accuracy loss; and 3) a robustness evaluation against process corners, studying their impact on energy consumption of full FPGA fabrics

    Guarded Evaluation: An Algorithm for Dynamic Power Reduction in FPGAs

    Get PDF
    Guarded evaluation is a power reduction technique that involves identifying sub-circuits (within a larger circuit) whose inputs can be held constant (guarded) at specific times during circuit operation, thereby reducing switching activity and lowering dynamic power. The concept is rooted in the property that under certain conditions, some signals within digital designs are not "observable" at design outputs, making the circuitry that generates such signals a candidate for guarding. Guarded evaluation has been demonstrated successfully for custom ASICs; in this work, we apply the technique to FPGAs. In ASICs, guarded evaluation entails adding additional hardware to the design, increasing silicon area and cost. Here, we apply the technique in a way that imposes minimal area overhead by leveraging existing unused circuitry within the FPGA. The LUT functionality is modified to incorporate the guards and reduce toggle rates. The primary challenge in guarded evaluation is in determining the specific conditions under which a sub-circuit's inputs can be held constant without impacting the larger circuit's functional correctness. We propose a simple solution to this problem based on discovering gating inputs using "non-inverting paths" and trimming inputs using "partial non-inverting paths" in the circuit's AND-Inverter graph representation. Experimental results show that guarded evaluation can reduce switching activity by as much as 32% for FPGAs with 6-LUT architectures and 25% for 4-LUT architectures, on average, and can reduce power consumption in the FPGA interconnect by 29% for 6-LUTs and 27% for 4-LUTs. A clustered architecture with four LUTs to a cluster and ten LUTs to a cluster produced the best power reduction results. We implement guarded evaluation at various stages of the FPGA CAD flow and analyze the reductions. We implement the algorithm as post technology mapping, post packing and post placement optimizations. Guarded Evaluation as a post technology mapping algorithm inserted the most number of guards and hence achieved the highest activity and interconnect reduction. However, guarding signals come with a cost of increased fanout and stress on routing resources. Packing and placement provides the algorithm with additional information of the circuit which is leveraged to insert high quality guards with minimal impact on routing. Experimental results show that post-packing and post-placement methods have comparable reductions to post-mapping with considerably lesser impact on the critical path delay and routability of the circuit

    Beyond the arithmetic constraint: depth-optimal mapping of logic chains in reconfigurable fabrics

    Get PDF
    Look-up table based FPGAs have migrated from a niche technology for design prototyping to a valuable end-product component and, in some cases, a replacement for general purpose processors and ASICs alike. One way architects have bridged the performance gap between FPGAs and ASICs is through the inclusion of specialized components such as multipliers, RAM modules, and microcontrollers. Another dedicated structure that has become standard in reconfigurable fabrics is the arithmetic carry chain. Currently, it is only used to map arithmetic operations as identified by HDL macros. For non-arithmetic operations, it is an idle but potentially powerful resource.;Obstacles to using the carry chain for generic logic operations include lack of architectural and computer-aided design support. Current carry-select architectures facilitate carry chain reuse, although they do so only for (K-1)-input operations. Additionally, hardware description language (HDL) macros are the only recourse for a designer wishing to map generic logic chains in a carry-select architecture. A novel architecture that allows the full K-input operational capacity of the carry chain to be harnessed is presented as a solution to current architectural limitations. It is shown to have negligible impact on logic element area and delay. Using only two additional 2:1 pass transistor multiplexers, it enables the transmission of a K-input operation to the carry chain and general routing simultaneously. To successfully identify logic chains in an arbitrary Boolean network, ChainMap is presented as a novel technology mapping algorithm. ChainMap creates delay-optimal generic logic chains in polynomial time without HDL macros. It maps both arithmetic and non-arithmetic logic chains whenever depth increasing nodes, which increase logic depth but not routing depth, are encountered. Use of the chain is not reserved for arithmetic, but rather any set of gates exhibiting similar characteristics. By using the carry chain as a generic, near zero-delay adjacent cell interconnection structure a potential average optimal speedup of 1.4x is revealed. Post place and route experiments indicate that ChainMap solutions perform similarly to HDL chains when cluster resources are abundant and significantly better in cluster-constrained arrays

    Software-based Approximate Computation Of Signal Processing Tasks

    Get PDF
    This thesis introduces a new dimension in performance scaling of signal processing systems by proposing software frameworks that achieve increased processing throughput when producing approximate results. The first contribution of this work is a new theory for accelerated computation of multimedia processing based on the concept of tight packing (Chapter 2). Usage of this theory accelerates small-dynamic-range linear signal processing tasks (such as convolution and transform decomposition) that map integers to integers, without incurring any accuracy loss. The concept of tight packing is combined with incremental computation that processes inputs in a bitplane-by-bitplane manner (Chapter 3), thereby leading to substantial throughput/distortion scalability within filtering, transform-decomposition and motion-estimation tasks. This framework also provides for region-of-interest computation and has inherent robustness to arbitrary termination of processing, imposed, for example, by a task scheduler. Finally, the concept of packed processing is extended to floating-point (lossy) matrix computations, with particular focus on the generic matrix multiplication (GEMM) routine of BLAS-3 (Chapters 4 and 5). This routine is a fundamental building block for several linear algebra and digital signal processing systems, such as face recognition and neural-network training for metadata-based retrieval systems. In order to compete with the best-performing software designs for GEMM, an implementation using single instruction, multiple data (SIMD) instructions is presented and analyzed. The proposed approach demonstrates substantial performance scaling in practice; specifically, it is shown to achieve up to twice the processing throughput of the best designs for GEMM when producing approximate results (under the same hardware). In summary, the proposed approximate computation of signal processing tasks can be selectively disabled thereby producing conventional full-precision/lower-throughput processing when deemed necessary. Importantly, the proposed software designs run on off-the-shelf computer hardware and provide for on-demand reconfiguration, depending on the input data and the precision specification (from full precision to noisy computation). Thus, the proposed approximate computation framework allows for backward compatibility and can be offered as an add-on service, creating significant competitive advantages for application developers. It can be used in mobile or high-performance computing systems when the precision of computation is not of critical importance (error-tolerant systems), or when the input data is intrinsically noisy

    Improving FPGA Performance and Area Using an Adaptive Logic Module

    No full text
    Abstract. This paper proposes a new adaptable FPGA logic element based on fracturable 6-LUTs, which fundamentally alters the longstanding belief that a 4-LUT is the most efficient area/delay tradeoff. We will describe theory and benchmarking results showing a 15 % performance increase with 12 % area decrease vs. a standard BLE4. The ALM structure is one of a number of architectural improvements giving Altera’s 90nm Stratix II architecture a 50 % performance advantage over its 130nm Stratix predecessor.

    Closing the Gap between FPGA and ASIC:Balancing Flexibility and Efficiency

    Get PDF
    Despite many advantages of Field-Programmable Gate Arrays (FPGAs), they fail to take over the IC design market from Application-Specific Integrated Circuits (ASICs) for high-volume and even medium-volume applications, as FPGAs come with significant cost in area, delay, and power consumption. There are two main reasons that FPGAs have huge efficiency gap with ASICs: (1) FPGAs are extremely flexible as they have fully programmable soft-logic blocks and routing networks, and (2) FPGAs have hard-logic blocks that are only usable by a subset of applications. In other words, current FPGAs have a heterogeneous structure comprised of the flexible soft-logic and the efficient hard-logic blocks that suffer from inefficiency and inflexibility, respectively. The inefficiency of the soft-logic is a challenge for any application that is mapped to FPGAs, and lack of flexibility in the hard-logic results in a waste of resources when an application cannot use the hard-logic. In this thesis, we approach the inefficiency problem of FPGAs by bridging the efficiency/flexibility gap of the hard- and soft-logic. The main goal of this thesis is to compromise on efficiency of the hard-logic for flexibility, on the one hand, and to compromise on flexibility of the soft-logic for efficiency, on the other hand. In other words, this thesis deals with two issues: (1) adding more generality to the hard-logic of FPGAs, and (2) improving the soft-logic by adapting it to the generic requirements of applications. In the first part of the thesis, we introduce new techniques that expand the functionality of FPGAs hard-logic. The hard-logic includes the dedicated resources that are tightly coupled with the soft-logic –i.e., adder circuitry and carry chains –as well as the stand-alone ones –i.e., DSP blocks. These specialized resources are intended to accelerate critical arithmetic operations that appear in the pre-synthesis representation of applications; we introduce mapping and architectural solutions, which enable both types of the hard-logic to support additional arithmetic operations. We first present a mapping technique that extends the application of FPGAs carry chains for carry-save arithmetic, and then to increase the generality of the hard-logic, we introduce novel architectures; using these architectures, more applications can take advantage of FPGAs hard-logic. In the second part of the thesis, we improve the efficiency of FPGAs soft-logic by exploiting the circuit patterns that emerge after logic synthesis, i.e., connection and logic patterns. Using these patterns, we design new soft-logic blocks that have less flexibility, but more efficiency than current ones. In this part, we first introduce logic chains, fixed connections that are integrated between the soft-logic blocks of FPGAs and are well-suited for long chains of logic that appear post-synthesis. Logic chains provide fast and low cost connectivity, increase the bandwidth of the logic blocks without changing their interface with the routing network, and improve the logic density of soft-logic blocks. In addition to logic chains and as a complementary contribution, we present a non-LUT soft-logic block that comprises simple and pre-connected cells. The structure of this logic block is inspired from the logic patterns that appear post-synthesis. This block has a complexity that is only linear in the number of inputs, it sports the potential for multiple independent outputs, and the delay is only logarithmic in the number of inputs. Although this new block is less flexible than a LUT, we show (1) that effective mapping algorithms exist, (2) that, due to their simplicity, poor utilization is less of an issue than with LUTs, and (3) that a few LUTs can still be used in extreme unfortunate cases. In summary, to bridge the gap between FPGAs and ASICs, we approach the problem from two complementary directions, which balance flexibility and efficiency of the logic blocks of FPGAs. However, we were able to explore a few design points in this thesis, and future work could focus on further exploration of the design space
    corecore