340 research outputs found

    Pipelining Saturated Accumulation

    Get PDF
    Aggressive pipelining and spatial parallelism allow integrated circuits (e.g., custom VLSI, ASICs, and FPGAs) to achieve high throughput on many Digital Signal Processing applications. However, cyclic data dependencies in the computation can limit parallelism and reduce the efficiency and speed of an implementation. Saturated accumulation is an important example where such a cycle limits the throughput of signal processing applications. We show how to reformulate saturated addition as an associative operation so that we can use a parallel-prefix calculation to perform saturated accumulation at any data rate supported by the device. This allows us, for example, to design a 16-bit saturated accumulator which can operate at 280 MHz on a Xilinx Spartan-3(XC3S-5000-4) FPGA, the maximum frequency supported by the component's DCM

    Pipelining saturated accumulation

    Full text link

    Performance Improvement for Reconfigurable Processor System Design in IoT Health Care Monitoring Applications

    Get PDF
    This research focuses on critical hardware components of an Internet of Things (IoT) system for reconfigurable processing systems. Single-Instruction Multiple-Data (SIMD) processors have recently been utilized to preprocess data at energy-constrained sensor nodes or IoT gateways, saving significant energy and bandwidth for transmission. Using traditional CPU-based systems to implement machine learning algorithms is inefficient in terms of energy consumption. In the proposed method Single-Instruction Multiple-Data (SIMD) processors are assembled by scaling the largest possible operand value subunits into direct access to the internal memory, where the carry output of each unit is conditionally fed into the next unit based on the implementation of the SIMD Processor design for Internet of Things applications. Each method has evaluated sub-operations that contribute considerably to the overall potential of the design. If the single register file can complete the intended action, a zero (one)-signal is applied to each unit\u27s carry input. Multiplexers combine two or more adders, sending the carry signal from one unit into another if additional units are necessary to compute the sum. The outcome results compare high-speed end device techniques in terms of area and power consumption. The proposed SIMD processor-based IoT healthcare monitoring system with a MIMD processor\u27s performance analysis of comparison clearly demonstrates that the system produces decent outcomes. The suggested system has an area overhead of 85 m2, a power usage of 4.10 W, and a time delay of 20 ns

    Efficient Computation and FPGA implementation of Fully Homomorphic Encryption with Cloud Computing Significance

    Get PDF
    Homomorphic Encryption provides unique security solution for cloud computing. It ensures not only that data in cloud have confidentiality but also that data processing by cloud server does not compromise data privacy. The Fully Homomorphic Encryption (FHE) scheme proposed by Lopez-Alt, Tromer, and Vaikuntanathan (LTV), also known as NTRU(Nth degree truncated polynomial ring) based method, is considered one of the most important FHE methods suitable for practical implementation. In this thesis, an efficient algorithm and architecture for LTV Fully Homomorphic Encryption is proposed. Conventional linear feedback shift register (LFSR) structure is expanded and modified for performing the truncated polynomial ring multiplication in LTV scheme in parallel. Novel and efficient modular multiplier, modular adder and modular subtractor are proposed to support high speed processing of LFSR operations. In addition, a family of special moduli are selected for high speed computation of modular operations. Though the area keeps the complexity of O(Nn^2) with no advantage in circuit level. The proposed architecture effectively reduces the time complexity from O(N log N) to linear time, O(N), compared to the best existing works. An FPGA implementation of the proposed architecture for LTV FHE is achieved and demonstrated. An elaborate comparison of the existing methods and the proposed work is presented, which shows the proposed work gains significant speed up over existing works

    ARITHMETIC LOGIC UNIT ARCHITECTURES WITH DYNAMICALLY DEFINED PRECISION

    Get PDF
    Modern central processing units (CPUs) employ arithmetic logic units (ALUs) that support statically defined precisions, often adhering to industry standards. Although CPU manufacturers highly optimize their ALUs, industry standard precisions embody accuracy and performance compromises for general purpose deployment. Hence, optimizing ALU precision holds great potential for improving speed and energy efficiency. Previous research on multiple precision ALUs focused on predefined, static precisions. Little previous work addressed ALU architectures with customized, dynamically defined precision. This dissertation presents approaches for developing dynamic precision ALU architectures for both fixed-point and floating-point to enable better performance, energy efficiency, and numeric accuracy. These new architectures enable dynamically defined precision, including support for vectorization. The new architectures also prevent performance and energy loss due to applying unnecessarily high precision on computations, which often happens with statically defined standard precisions. The new ALU architectures support different precisions through the use of configurable sub-blocks, with this dissertation including demonstration implementations for floating point adder, multiply, and fused multiply-add (FMA) circuits with 4-bit sub-blocks. For these circuits, the dynamic precision ALU speed is nearly the same as traditional ALU approaches, although the dynamic precision ALU is nearly twice as large

    Efficient schemes to size transistors for optimal delay by solving fanout branches with balancing algorithm

    Get PDF
    High performance digital system requires minimal logic and properly sized transistor to operate in all PVT corners. Specifically, high-speed data-path design is mostly about optimizing the system for better timing. In this work, the author proposed a better timing model to analyze parallel data-paths better for performance comparison. Moreover, a novel transistor sizing technique is also proposed as part of the work to minimize delay in parallel data-path circuits in the presence of practical wire capacitance. With this technique it is easier to calculate the optimal capacitance distribution in a fanout branch path that equalizes the delays in all branches as well as minimizes the overall delay starting from the primary inputs to the primary outputs of a circuit. The problem is widely termed as the "Load distribution problem at branch". A collection of fast algorithms were designed to accurately solve the load distribution problem for branch in digital circuits for optimal delay. The author used prior work on Unified Logical Effort[1] as a tool for delay estimation and transistor sizing. This research work also shows the impact of branching on critical path. Experiments are run on industry standard circuits using different types of tools developed to model the circuit. The new developed theories are tested on the circuit models , that are also included in this work
    • …
    corecore