107 research outputs found

    LUXOR: An FPGA Logic Cell Architecture for Efficient Compressor Tree Implementations

    Full text link
    We propose two tiers of modifications to FPGA logic cell architecture to deliver a variety of performance and utilization benefits with only minor area overheads. In the irst tier, we augment existing commercial logic cell datapaths with a 6-input XOR gate in order to improve the expressiveness of each element, while maintaining backward compatibility. This new architecture is vendor-agnostic, and we refer to it as LUXOR. We also consider a secondary tier of vendor-speciic modifications to both Xilinx and Intel FPGAs, which we refer to as X-LUXOR+ and I-LUXOR+ respectively. We demonstrate that compressor tree synthesis using generalized parallel counters (GPCs) is further improved with the proposed modifications. Using both the Intel adaptive logic module and the Xilinx slice at the 65nm technology node for a comparative study, it is shown that the silicon area overhead is less than 0.5% for LUXOR and 5-6% for LUXOR+, while the delay increments are 1-6% and 3-9% respectively. We demonstrate that LUXOR can deliver an average reduction of 13-19% in logic utilization on micro-benchmarks from a variety of domains.BNN benchmarks benefit the most with an average reduction of 37-47% in logic utilization, which is due to the highly-efficient mapping of the XnorPopcount operation on our proposed LUXOR+ logic cells.Comment: In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA'20), February 23-25, 2020, Seaside, CA, US

    LUT Based Generalized Parallel Counters for State-of-art FPGAs

    Get PDF
    Generalized Parallel Counters (GPCs) are frequently used in constructing high speed compressor trees. Previous work has focused on achieving efficient mapping of GPCs on FPGAs by using a combination of general Look-up table (LUT) fabric and specialized fast carry chains. The resulting structures are purely combinational and cannot be efficiently pipelined to achieve the potential FPGA performance. In this paper, we take an alternate approach and try to eliminate the fast carry chain from the GPC structure. We present a heuristic that maps GPCs on FPGAS using only general LUT fabric. The resultant GPCs are then easily re-timed by placing registers at the fan-out nodes of each LUT. We have used our heuristic on various GPCs reported in prior work. Our heuristic successfully eliminates the carry chain from the GPC structure with the same LUT count in most of the cases. Experimental results using Xilinx Kintex-7 FPGAs show a considerable reduction in critical path and dynamic power dissipation with same area utilization in most of the cases

    Design of approximate overclocked datapath

    Get PDF
    Embedded applications can often demand stringent latency requirements. While high degrees of parallelism within custom FPGA-based accelerators may help to some extent, it may also be necessary to limit the precision used in the datapath to boost the operating frequency of the implementation. However, by reducing the precision, the engineer introduces quantisation error into the design. In this thesis, we describe an alternative circuit design methodology when considering trade-offs between accuracy, performance and silicon area. We compare two different approaches that could trade accuracy for performance. One is the traditional approach where the precision used in the datapath is limited to meet a target latency. The other is a proposed new approach which simply allows the datapath to operate without timing closure. We demonstrate analytically and experimentally that for many applications it would be preferable to simply overclock the design and accept that timing violations may arise. Since the errors introduced by timing violations occur rarely, they will cause less noise than quantisation errors. Furthermore, we show that conventional forms of computer arithmetic do not fail gracefully when pushed beyond the deterministic clocking region. In this thesis we take a fresh look at Online Arithmetic, originally proposed for digit serial operation, and synthesize unrolled digit parallel online arithmetic operators to allow for graceful degradation. We quantify the impact of timing violations on key arithmetic primitives, and show that substantial performance benefits can be obtained in comparison to binary arithmetic. Since timing errors are caused by long carry chains, these result in errors in least significant digits with online arithmetic, causing less impact than conventional implementations.Open Acces

    Techniques for Efficient Implementation of FIR and Particle Filtering

    Full text link

    Closing the Gap between FPGA and ASIC:Balancing Flexibility and Efficiency

    Get PDF
    Despite many advantages of Field-Programmable Gate Arrays (FPGAs), they fail to take over the IC design market from Application-Specific Integrated Circuits (ASICs) for high-volume and even medium-volume applications, as FPGAs come with significant cost in area, delay, and power consumption. There are two main reasons that FPGAs have huge efficiency gap with ASICs: (1) FPGAs are extremely flexible as they have fully programmable soft-logic blocks and routing networks, and (2) FPGAs have hard-logic blocks that are only usable by a subset of applications. In other words, current FPGAs have a heterogeneous structure comprised of the flexible soft-logic and the efficient hard-logic blocks that suffer from inefficiency and inflexibility, respectively. The inefficiency of the soft-logic is a challenge for any application that is mapped to FPGAs, and lack of flexibility in the hard-logic results in a waste of resources when an application cannot use the hard-logic. In this thesis, we approach the inefficiency problem of FPGAs by bridging the efficiency/flexibility gap of the hard- and soft-logic. The main goal of this thesis is to compromise on efficiency of the hard-logic for flexibility, on the one hand, and to compromise on flexibility of the soft-logic for efficiency, on the other hand. In other words, this thesis deals with two issues: (1) adding more generality to the hard-logic of FPGAs, and (2) improving the soft-logic by adapting it to the generic requirements of applications. In the first part of the thesis, we introduce new techniques that expand the functionality of FPGAs hard-logic. The hard-logic includes the dedicated resources that are tightly coupled with the soft-logic –i.e., adder circuitry and carry chains –as well as the stand-alone ones –i.e., DSP blocks. These specialized resources are intended to accelerate critical arithmetic operations that appear in the pre-synthesis representation of applications; we introduce mapping and architectural solutions, which enable both types of the hard-logic to support additional arithmetic operations. We first present a mapping technique that extends the application of FPGAs carry chains for carry-save arithmetic, and then to increase the generality of the hard-logic, we introduce novel architectures; using these architectures, more applications can take advantage of FPGAs hard-logic. In the second part of the thesis, we improve the efficiency of FPGAs soft-logic by exploiting the circuit patterns that emerge after logic synthesis, i.e., connection and logic patterns. Using these patterns, we design new soft-logic blocks that have less flexibility, but more efficiency than current ones. In this part, we first introduce logic chains, fixed connections that are integrated between the soft-logic blocks of FPGAs and are well-suited for long chains of logic that appear post-synthesis. Logic chains provide fast and low cost connectivity, increase the bandwidth of the logic blocks without changing their interface with the routing network, and improve the logic density of soft-logic blocks. In addition to logic chains and as a complementary contribution, we present a non-LUT soft-logic block that comprises simple and pre-connected cells. The structure of this logic block is inspired from the logic patterns that appear post-synthesis. This block has a complexity that is only linear in the number of inputs, it sports the potential for multiple independent outputs, and the delay is only logarithmic in the number of inputs. Although this new block is less flexible than a LUT, we show (1) that effective mapping algorithms exist, (2) that, due to their simplicity, poor utilization is less of an issue than with LUTs, and (3) that a few LUTs can still be used in extreme unfortunate cases. In summary, to bridge the gap between FPGAs and ASICs, we approach the problem from two complementary directions, which balance flexibility and efficiency of the logic blocks of FPGAs. However, we were able to explore a few design points in this thesis, and future work could focus on further exploration of the design space

    Rethinking FPGA Architectures for Deep Neural Network applications

    Get PDF
    The prominence of machine learning-powered solutions instituted an unprecedented trend of integration into virtually all applications with a broad range of deployment constraints from tiny embedded systems to large-scale warehouse computing machines. While recent research confirms the edges of using contemporary FPGAs to deploy or accelerate machine learning applications, especially where the latency and energy consumption are strictly limited, their pre-machine learning optimised architectures remain a barrier to the overall efficiency and performance. Realizing this shortcoming, this thesis demonstrates an architectural study aiming at solutions that enable hidden potentials in the FPGA technology, primarily for machine learning algorithms. Particularly, it shows how slight alterations to the state-of-the-art architectures could significantly enhance the FPGAs toward becoming more machine learning-friendly while maintaining the near-promised performance for the rest of the applications. Eventually, it presents a novel systematic approach to deriving new block architectures guided by designing limitations and machine learning algorithm characteristics through benchmarking. First, through three modifications to Xilinx DSP48E2 blocks, an enhanced digital signal processing (DSP) block for important computations in embedded deep neural network (DNN) accelerators is described. Then, two tiers of modifications to FPGA logic cell architecture are explained that deliver a variety of performance and utilisation benefits with only minor area overheads. Eventually, with the goal of exploring this new design space in a methodical manner, a problem formulation involving computing nested loops over multiply-accumulate (MAC) operations is first proposed. A quantitative methodology for deriving efficient coarse-grained compute block architectures from benchmarks is then suggested together with a family of new embedded blocks, called MLBlocks

    Domain-specific and reconfigurable instruction cells based architectures for low-power SoC

    Get PDF

    FPGA Implementation of Fast Binary Multiplication Based on Customized Basic Cells

    Get PDF
    Multiplication is considered one of the most time-consuming and a key operation in wide variety of embedded applications. Speeding up this operation has a significant impact on the overall performance of these applications. A vast number of multiplication approaches are found in the literature where the goal is always to achieve a higher performance. One of these approaches relies on using smaller multiplier blocks which are built based on direct Boolean algebra equations to build large multipliers. In this work, we present a methodology for designing binary multipliers where different sizes customized partial products generation (CPPG) cells are designed and used as smaller building blocks. The sizes of the designed CPPG cells are 2×2, 3×3, 4×4, 5×5, and 6×6. We use these cells to build 8×8, 16×16, 32×32, 64×64, and 128×128 binary multipliers. All of the CPPG cells and the binary multipliers are described using the VHDL language, tested, and implemented using XILINX ISE 14.6 tools targeting different FPGA families. The implementation results show that the best performance is achieved when cell 3×3 is used and Virtex-7 FPGA is targeted. The binary multipliers that are designed using the proposed CPPG cells achieve better performance when compared with the binary multipliers presented in the literature. As an application that utilizes the proposed multiplier, a Multiply-Accumulate (MAC) unit is designed and implemented in Spartan-3E. The implementation results of the MAC unit demonstrate the effectiveness of the proposed multiplier
    corecore