140 research outputs found

    Pipelining Saturated Accumulation

    Get PDF
    Aggressive pipelining and spatial parallelism allow integrated circuits (e.g., custom VLSI, ASICs, and FPGAs) to achieve high throughput on many Digital Signal Processing applications. However, cyclic data dependencies in the computation can limit parallelism and reduce the efficiency and speed of an implementation. Saturated accumulation is an important example where such a cycle limits the throughput of signal processing applications. We show how to reformulate saturated addition as an associative operation so that we can use a parallel-prefix calculation to perform saturated accumulation at any data rate supported by the device. This allows us, for example, to design a 16-bit saturated accumulator which can operate at 280 MHz on a Xilinx Spartan-3(XC3S-5000-4) FPGA, the maximum frequency supported by the component's DCM

    Performance Modeling of Virtualized Custom Logic Computations

    Get PDF
    Virtualization of custom logic computations (i.e., by sharing a fixed function across distinct data streams) provides a means of reusing hardware resources, particularly when resources are limited. This is common practice in traditional processors where more than one user can share processor resources. In this paper, we virtualize a custom logic block using C-slow techniques to support fine-grain context-switching. We then develop and present an analytic model for several performance measures (throughput, latency, input queue occupancy) for both fine-grained and coarse-grained context switching (to a secondary memory). Next, we calibrate the analytic performance model with empirical measurements. We then validate the model via discrete-event simulation and use the model to predict the performance and develop optimal schedules for virtualized logic computations. We present results for a Taylor series expansion of a cosine function with added feedback and an AES encryption cipher

    A hierarchical mathematical model for automatic pipelining and allocation using elastic systems

    Get PDF
    The advent of FPGA-based accelerators has encouraged the use of high-level synthesis (HLS) for rapid prototyping and design space exploration. In this context, design optimization at behavioral level becomes a critical task for the delivery of high-quality solutions. Time elasticity opens a new avenue of optimizations that can be applied after HLS and before logic synthesis, proposing new sequential transformations that expand beyond classical retiming and enlarge the register-transfer level (RTL) exploration space. This paper proposes a mathematical model for RTL transformations that exploit elasticity to select the best implementation for each functional unit and add pipeline registers to increase performance. Two simple examples are used to validate the effectiveness and potential benefits of the model.Peer ReviewedPostprint (author's final draft

    The application of genetic algorithms to high-level synthesis

    Get PDF

    Driving the Network-on-Chip Revolution to Remove the Interconnect Bottleneck in Nanoscale Multi-Processor Systems-on-Chip

    Get PDF
    The sustained demand for faster, more powerful chips has been met by the availability of chip manufacturing processes allowing for the integration of increasing numbers of computation units onto a single die. The resulting outcome, especially in the embedded domain, has often been called SYSTEM-ON-CHIP (SoC) or MULTI-PROCESSOR SYSTEM-ON-CHIP (MP-SoC). MPSoC design brings to the foreground a large number of challenges, one of the most prominent of which is the design of the chip interconnection. With a number of on-chip blocks presently ranging in the tens, and quickly approaching the hundreds, the novel issue of how to best provide on-chip communication resources is clearly felt. NETWORKS-ON-CHIPS (NoCs) are the most comprehensive and scalable answer to this design concern. By bringing large-scale networking concepts to the on-chip domain, they guarantee a structured answer to present and future communication requirements. The point-to-point connection and packet switching paradigms they involve are also of great help in minimizing wiring overhead and physical routing issues. However, as with any technology of recent inception, NoC design is still an evolving discipline. Several main areas of interest require deep investigation for NoCs to become viable solutions: • The design of the NoC architecture needs to strike the best tradeoff among performance, features and the tight area and power constraints of the onchip domain. • Simulation and verification infrastructure must be put in place to explore, validate and optimize the NoC performance. • NoCs offer a huge design space, thanks to their extreme customizability in terms of topology and architectural parameters. Design tools are needed to prune this space and pick the best solutions. • Even more so given their global, distributed nature, it is essential to evaluate the physical implementation of NoCs to evaluate their suitability for next-generation designs and their area and power costs. This dissertation performs a design space exploration of network-on-chip architectures, in order to point-out the trade-offs associated with the design of each individual network building blocks and with the design of network topology overall. The design space exploration is preceded by a comparative analysis of state-of-the-art interconnect fabrics with themselves and with early networkon- chip prototypes. The ultimate objective is to point out the key advantages that NoC realizations provide with respect to state-of-the-art communication infrastructures and to point out the challenges that lie ahead in order to make this new interconnect technology come true. Among these latter, technologyrelated challenges are emerging that call for dedicated design techniques at all levels of the design hierarchy. In particular, leakage power dissipation, containment of process variations and of their effects. The achievement of the above objectives was enabled by means of a NoC simulation environment for cycleaccurate modelling and simulation and by means of a back-end facility for the study of NoC physical implementation effects. Overall, all the results provided by this work have been validated on actual silicon layout

    A High Throughput/Gate AES Hardware Architecture by Compressing Encryption and Decryption Datapaths --- Toward Efficient CBC-Mode Implementation

    Get PDF
    This paper proposes a highly efficient AES hardware architecture that supports both encryption and decryption for the CBC mode. Some conventional AES architectures employ pipelining techniques to enhance the throughput and efficiency. However, such pipelined architectures are frequently unfit because many practical cryptographic applications work in the CBC mode, where block-wise parallelism is not available for encryption. In this paper, we present an efficient AES encryption/decryption hardware design suitable for such block-chaining modes. In particular, new operation-reordering and register-retiming techniques allow us to unify the inversion circuits for encryption and decryption (i.e., SubBytes and InvSubBytes) without any delay overhead. A new unification technique for linear mappings further reduces both the area and critical delay in total. Our design employs a common loop architecture and can therefore efficiently perform even in the CBC mode. We also present a shared key scheduling datapath that can work on-the-fly in the proposed architecture. To the best of our knowledge, the proposed architecture has the shortest critical path delay and the most efficient in terms of throughput per area among conventional AES encryption/decryption architectures with tower-field S-boxes. We evaluate the performance of the proposed and some conventional datapaths by logic synthesis results with the TSMC 65-nm standard-cell library and NanGate 45- and 15-nm open-cell libraries. As a result, we confirm that our proposed architecture achieves approximately 53--72% higher efficiency (i.e., a higher bps/GE) than any other conventional counterpart

    A general model for performance optimization of sequential systems

    Get PDF
    Retiming, c-slow retiming and recycling are different transformations for the performance optimization of sequential circuits. For retiming and c-slow retiming, different models that provide exact solutions have already been proposed. An exact model for recycling was yet unknown. This paper presents a general formulation that covers the combination of the three schemes for performance optimization. It provides an exact model based on integer linear programming that resorts to the structural theory of marked graphs. A set of experiments has been designed to show the benefits in performance obtained by combining retiming and recycling. The results also show the applicability of the method in large circuits.Peer ReviewedPostprint (published version

    Exploiting parallelism within multidimensional multirate digital signal processing systems

    Get PDF
    The intense requirements for high processing rates of multidimensional Digital Signal Processing systems in practical applications justify the Application Specific Integrated Circuits designs and parallel processing implementations. In this dissertation, we propose novel theories, methodologies and architectures in designing high-performance VLSI implementations for general multidimensional multirate Digital Signal Processing systems by exploiting the parallelism within those applications. To systematically exploit the parallelism within the multidimensional multirate DSP algorithms, we develop novel transformations including (1) nonlinear I/O data space transforms, (2) intercalation transforms, and (3) multidimensional multirate unfolding transforms. These transformations are applied to the algorithms leading to systematic methodologies in high-performance architectural designs. With the novel design methodologies, we develop several architectures with parallel and distributed processing features for implementing multidimensional multirate applications. Experimental results have shown that those architectures are much more efficient in terms of execution time and/or hardware cost compared with existing hardware implementations
    corecore