6 research outputs found

    Multi-Bit Carry Chains for High-Performance Reconfigurable Fabrics

    Get PDF
    Ripple-carry architectures are the norm in today\u27s reconfigurable fabrics. They are simple, require minimal routing, and are easily formed across arbitrary cells in a fabric. However, their computation delay grows linearly with operand width. Many different fabric carry-chains have been presented in literature offering non-linear delays, but generally require a significant investment in routing and processing area. Carry-skip chains are well-known in arithmetic logic design, and although they too possess a linear delay, their performance is 2times or more faster than simple ripple-carry schemes. They require an expanded carry chain and minimal extra logic, but offer impressive speed-ups for arithmetic. This paper presents a reconfigurable cell that supports carry-skip arithmetic using a multi-bit carry chain achieving 2 middot k middot b+ n/b performance, where b is the block size and k is an architecture constant. The cell is specialized for arithmetic and Boolean operations with reduced configuration memory. Additional resources are provided to reuse the multi-bit carry chain for 3-source operand arithmetic to explore how multi-bit chains can be reused

    Beyond the arithmetic constraint: depth-optimal mapping of logic chains in reconfigurable fabrics

    Get PDF
    Look-up table based FPGAs have migrated from a niche technology for design prototyping to a valuable end-product component and, in some cases, a replacement for general purpose processors and ASICs alike. One way architects have bridged the performance gap between FPGAs and ASICs is through the inclusion of specialized components such as multipliers, RAM modules, and microcontrollers. Another dedicated structure that has become standard in reconfigurable fabrics is the arithmetic carry chain. Currently, it is only used to map arithmetic operations as identified by HDL macros. For non-arithmetic operations, it is an idle but potentially powerful resource.;Obstacles to using the carry chain for generic logic operations include lack of architectural and computer-aided design support. Current carry-select architectures facilitate carry chain reuse, although they do so only for (K-1)-input operations. Additionally, hardware description language (HDL) macros are the only recourse for a designer wishing to map generic logic chains in a carry-select architecture. A novel architecture that allows the full K-input operational capacity of the carry chain to be harnessed is presented as a solution to current architectural limitations. It is shown to have negligible impact on logic element area and delay. Using only two additional 2:1 pass transistor multiplexers, it enables the transmission of a K-input operation to the carry chain and general routing simultaneously. To successfully identify logic chains in an arbitrary Boolean network, ChainMap is presented as a novel technology mapping algorithm. ChainMap creates delay-optimal generic logic chains in polynomial time without HDL macros. It maps both arithmetic and non-arithmetic logic chains whenever depth increasing nodes, which increase logic depth but not routing depth, are encountered. Use of the chain is not reserved for arithmetic, but rather any set of gates exhibiting similar characteristics. By using the carry chain as a generic, near zero-delay adjacent cell interconnection structure a potential average optimal speedup of 1.4x is revealed. Post place and route experiments indicate that ChainMap solutions perform similarly to HDL chains when cluster resources are abundant and significantly better in cluster-constrained arrays

    THE VLIW-SUPERCISC COMPILER: EXPLOITINGPARALLELISM FROM C-BASED APPLICATIONS

    Get PDF
    A common approach to decreasing embedded application execution time is creating a homogeneous parallel processor architecture. The parallelism of any such architecture is limited to the number of instructions that can be scheduled in the same cycle. This number of instructions scheduled in a cycle, or instruction-level parallelism (ILP), is limited by the ability to extract parallelism from the application. Other techniques attempt to improve performance with hardware acceleration. Often, segments of highly computational extensive code are extracted and custom hardware is created to replace the software execution. This technique requires many resources and still does not address the segments of code outside of the computationally extensive kernel.To solve this problem, hardware acceleration for computationally intensive segments of code in addition to accelerating the entire application with very long instruction word, VLIW, techniques is proposed. (1) A compilation flow that targets a 4-wide VLIW processor architecture is presented. This system was used to investigate the available speed-up of VLIW architectures. The architecture was modified to combine the VLIW processor with the capability to execute application specific customized instructions. To create the custom instruction hardware, a control and data flow graph (CDFG) framework was created. The CDFG framework was created to provide a framework for compiler transformations and hardware generation. In order to remove control flow from segments of code selected for hardware generation, (2) the technique of hardware predication was developed. Hardware predication allows if-then and if-then-else control flow constructs to be transformed into strict data flow through the use of multiplexors. From the transformed CDFGs, (3) a VHDL generation pass was created that translates the compiler data structures into synthesizable VHDL. The resulting architecture contains the VLIW processor and tightly coupled application specific hardware. This architecture was analyzed for performance changes comparedto the initial VLIW architecture, and a traditional processor. Lastly, (4) the architecture was analyzed for power and energy savings. A post static timing pass was added to the compilation flow for the insertion of hardware to delay early switching of operations.By measuring only the execution of the hardware function and comparing the performance to the equivalent code executed in software, a performance multiplier of up to 322 times is seen when synthesized onto an Altera Stratix II ES2S180F1508C4 FPGA. The average performance increase seen was 63 times faster. For the entire application, the speedup reached nearly 30X and was on average 12X better than a single processor implementation. The power and energy required by the VLIW processor core and the hardware functions for the computational kernels after 160nm OKI standard cell ASIC synthesis show a maximum power savings of 417 times that of execution on the processor with an average of 133 times savings in power consumption. With the increased execution time and the savings in power the energy savings will see a multiplicative effect. The energy improvement is therefore several orders of magnitude for the hardware functions, the savings range from over 1,000X to approximately 60,000X

    Closing the Gap between FPGA and ASIC:Balancing Flexibility and Efficiency

    Get PDF
    Despite many advantages of Field-Programmable Gate Arrays (FPGAs), they fail to take over the IC design market from Application-Specific Integrated Circuits (ASICs) for high-volume and even medium-volume applications, as FPGAs come with significant cost in area, delay, and power consumption. There are two main reasons that FPGAs have huge efficiency gap with ASICs: (1) FPGAs are extremely flexible as they have fully programmable soft-logic blocks and routing networks, and (2) FPGAs have hard-logic blocks that are only usable by a subset of applications. In other words, current FPGAs have a heterogeneous structure comprised of the flexible soft-logic and the efficient hard-logic blocks that suffer from inefficiency and inflexibility, respectively. The inefficiency of the soft-logic is a challenge for any application that is mapped to FPGAs, and lack of flexibility in the hard-logic results in a waste of resources when an application cannot use the hard-logic. In this thesis, we approach the inefficiency problem of FPGAs by bridging the efficiency/flexibility gap of the hard- and soft-logic. The main goal of this thesis is to compromise on efficiency of the hard-logic for flexibility, on the one hand, and to compromise on flexibility of the soft-logic for efficiency, on the other hand. In other words, this thesis deals with two issues: (1) adding more generality to the hard-logic of FPGAs, and (2) improving the soft-logic by adapting it to the generic requirements of applications. In the first part of the thesis, we introduce new techniques that expand the functionality of FPGAs hard-logic. The hard-logic includes the dedicated resources that are tightly coupled with the soft-logic –i.e., adder circuitry and carry chains –as well as the stand-alone ones –i.e., DSP blocks. These specialized resources are intended to accelerate critical arithmetic operations that appear in the pre-synthesis representation of applications; we introduce mapping and architectural solutions, which enable both types of the hard-logic to support additional arithmetic operations. We first present a mapping technique that extends the application of FPGAs carry chains for carry-save arithmetic, and then to increase the generality of the hard-logic, we introduce novel architectures; using these architectures, more applications can take advantage of FPGAs hard-logic. In the second part of the thesis, we improve the efficiency of FPGAs soft-logic by exploiting the circuit patterns that emerge after logic synthesis, i.e., connection and logic patterns. Using these patterns, we design new soft-logic blocks that have less flexibility, but more efficiency than current ones. In this part, we first introduce logic chains, fixed connections that are integrated between the soft-logic blocks of FPGAs and are well-suited for long chains of logic that appear post-synthesis. Logic chains provide fast and low cost connectivity, increase the bandwidth of the logic blocks without changing their interface with the routing network, and improve the logic density of soft-logic blocks. In addition to logic chains and as a complementary contribution, we present a non-LUT soft-logic block that comprises simple and pre-connected cells. The structure of this logic block is inspired from the logic patterns that appear post-synthesis. This block has a complexity that is only linear in the number of inputs, it sports the potential for multiple independent outputs, and the delay is only logarithmic in the number of inputs. Although this new block is less flexible than a LUT, we show (1) that effective mapping algorithms exist, (2) that, due to their simplicity, poor utilization is less of an issue than with LUTs, and (3) that a few LUTs can still be used in extreme unfortunate cases. In summary, to bridge the gap between FPGAs and ASICs, we approach the problem from two complementary directions, which balance flexibility and efficiency of the logic blocks of FPGAs. However, we were able to explore a few design points in this thesis, and future work could focus on further exploration of the design space

    High-Performance Carry Chains for FPGAs

    No full text
    Carry chains are an important consideration for most computations, including FPGAs. Current FPGAs dedicate a portion of their logic to support these demands via a simple ripple carry scheme. In this paper we demonstrate how more advanced carry constructs can be embedded into FPGAs, providing significantly higher performance carry computations. We redesign the standard ripple carry chain to reduce the number of logic levels in each cell. We also develop entirely new carry structures based on high performance adders such as Carry Select, Carry Lookahead, and Brent-Kung. Overall, these optimizations achieve a speedup in carry performance of 3.8 times over current architectures. Introduction Although originally intended as a way to efficiently handle random logic tasks in standard hardware systems, FPGAs have become the driving force behind a new computing paradigm. By mapping algorithms to these FPGAs significant performance benefits can be achieved. However, in order to achieve ..
    corecore