276 research outputs found

    IEEE Compliant Double-Precision FPU and 64-bit ALU with Variable Latency Integer Divider

    Get PDF
    Together the arithmetic logic unit (ALU) and floating-point unit (FPU) perform all of the mathematical and logic operations of computer processors. Because they are used so prominently, they fall in the critical path of the central processing unit - often becoming the bottleneck, or limiting factor for performance. As such, the design of a high-speed ALU and FPU is vital to creating a processor capable of performing up to the demanding standards of today\u27s computer users. In this paper, both a 64-bit ALU and a 64-bit FPU are designed based on the reduced instruction set computer architecture. The ALU performs the four basic mathematical operations - addition, subtraction, multiplication and division - in both unsigned and two\u27s complement format, basic logic operations and shifting. The division algorithm is a novel approach, using a comparison multiples based SRT divider to create a variable latency integer divider. The floating-point unit performs the double-precision floating-point operations add, subtract, multiply and divide, in accordance with the IEEE 754 standard for number representation and rounding. The ALU and FPU were implemented in VHDL, simulated in ModelSim, and constrained and synthesized using Synopsys Design Compiler (2006.06). They were synthesized using TSMC 0.1 3nm CMOS technology. The timing, power and area synthesis results were recorded, and, where applicable, compared to those of the corresponding DesignWare components.The ALU synthesis reported an area of 122,215 gates, a power of 384 mW, and a delay of 2.89 ns - a frequency of 346 MHz. The FPU synthesis reported an area 84,440 gates, a delay of 2.82 ns and an operating frequency of 355 MHz. It has a maximum dynamic power of 153.9 mW

    High sample-rate Givens rotations for recursive least squares

    Get PDF
    The design of an application-specific integrated circuit of a parallel array processor is considered for recursive least squares by QR decomposition using Givens rotations, applicable in adaptive filtering and beamforming applications. Emphasis is on high sample-rate operation, which, for this recursive algorithm, means that the time to perform arithmetic operations is critical. The algorithm, architecture and arithmetic are considered in a single integrated design procedure to achieve optimum results. A realisation approach using standard arithmetic operators, add, multiply and divide is adopted. The design of high-throughput operators with low delay is addressed for fixed- and floating-point number formats, and the application of redundant arithmetic considered. New redundant multiplier architectures are presented enabling reductions in area of up to 25%, whilst maintaining low delay. A technique is presented enabling the use of a conventional tree multiplier in recursive applications, allowing savings in area and delay. Two new divider architectures are presented showing benefits compared with the radix-2 modified SRT algorithm. Givens rotation algorithms are examined to determine their suitability for VLSI implementation. A novel algorithm, based on the Squared Givens Rotation (SGR) algorithm, is developed enabling the sample-rate to be increased by a factor of approximately 6 and offering area reductions up to a factor of 2 over previous approaches. An estimated sample-rate of 136 MHz could be achieved using a standard cell approach and O.35pm CMOS technology. The enhanced SGR algorithm has been compared with a CORDIC approach and shown to benefit by a factor of 3 in area and over 11 in sample-rate. When compared with a recent implementation on a parallel array of general purpose (GP) DSP chips, it is estimated that a single application specific chip could offer up to 1,500 times the computation obtained from a single OP DSP chip

    Algorithms and architectures for decimal transcendental function computation

    Get PDF
    Nowadays, there are many commercial demands for decimal floating-point (DFP) arithmetic operations such as financial analysis, tax calculation, currency conversion, Internet based applications, and e-commerce. This trend gives rise to further development on DFP arithmetic units which can perform accurate computations with exact decimal operands. Due to the significance of DFP arithmetic, the IEEE 754-2008 standard for floating-point arithmetic includes it in its specifications. The basic decimal arithmetic unit, such as decimal adder, subtracter, multiplier, divider or square-root unit, as a main part of a decimal microprocessor, is attracting more and more researchers' attentions. Recently, the decimal-encoded formats and DFP arithmetic units have been implemented in IBM's system z900, POWER6, and z10 microprocessors. Increasing chip densities and transistor count provide more room for designers to add more essential functions on application domains into upcoming microprocessors. Decimal transcendental functions, such as DFP logarithm, antilogarithm, exponential, reciprocal and trigonometric, etc, as useful arithmetic operations in many areas of science and engineering, has been specified as the recommended arithmetic in the IEEE 754-2008 standard. Thus, virtually all the computing systems that are compliant with the IEEE 754-2008 standard could include a DFP mathematical library providing transcendental function computation. Based on the development of basic decimal arithmetic units, more complex DFP transcendental arithmetic will be the next building blocks in microprocessors. In this dissertation, we researched and developed several new decimal algorithms and architectures for the DFP transcendental function computation. These designs are composed of several different methods: 1) the decimal transcendental function computation based on the table-based first-order polynomial approximation method; 2) DFP logarithmic and antilogarithmic converters based on the decimal digit-recurrence algorithm with selection by rounding; 3) a decimal reciprocal unit using the efficient table look-up based on Newton-Raphson iterations; and 4) a first radix-100 division unit based on the non-restoring algorithm with pre-scaling method. Most decimal algorithms and architectures for the DFP transcendental function computation developed in this dissertation have been the first attempt to analyze and implement the DFP transcendental arithmetic in order to achieve faithful results of DFP operands, specified in IEEE 754-2008. To help researchers evaluate the hardware performance of DFP transcendental arithmetic units, the proposed architectures based on the different methods are modeled, verified and synthesized using FPGAs or with CMOS standard cells libraries in ASIC. Some of implementation results are compared with those of the binary radix-16 logarithmic and exponential converters; recent developed high performance decimal CORDIC based architecture; and Intel's DFP transcendental function computation software library. The comparison results show that the proposed architectures have significant speed-up in contrast to the above designs in terms of the latency. The algorithms and architectures developed in this dissertation provide a useful starting point for future hardware-oriented DFP transcendental function computation researches

    A Survey on Approximate Multiplier Designs for Energy Efficiency: From Algorithms to Circuits

    Full text link
    Given the stringent requirements of energy efficiency for Internet-of-Things edge devices, approximate multipliers, as a basic component of many processors and accelerators, have been constantly proposed and studied for decades, especially in error-resilient applications. The computation error and energy efficiency largely depend on how and where the approximation is introduced into a design. Thus, this article aims to provide a comprehensive review of the approximation techniques in multiplier designs ranging from algorithms and architectures to circuits. We have implemented representative approximate multiplier designs in each category to understand the impact of the design techniques on accuracy and efficiency. The designs can then be effectively deployed in high-level applications, such as machine learning, to gain energy efficiency at the cost of slight accuracy loss.Comment: 38 pages, 37 figure

    Serial-data computation in VLSI

    Get PDF

    Design of approximate overclocked datapath

    Get PDF
    Embedded applications can often demand stringent latency requirements. While high degrees of parallelism within custom FPGA-based accelerators may help to some extent, it may also be necessary to limit the precision used in the datapath to boost the operating frequency of the implementation. However, by reducing the precision, the engineer introduces quantisation error into the design. In this thesis, we describe an alternative circuit design methodology when considering trade-offs between accuracy, performance and silicon area. We compare two different approaches that could trade accuracy for performance. One is the traditional approach where the precision used in the datapath is limited to meet a target latency. The other is a proposed new approach which simply allows the datapath to operate without timing closure. We demonstrate analytically and experimentally that for many applications it would be preferable to simply overclock the design and accept that timing violations may arise. Since the errors introduced by timing violations occur rarely, they will cause less noise than quantisation errors. Furthermore, we show that conventional forms of computer arithmetic do not fail gracefully when pushed beyond the deterministic clocking region. In this thesis we take a fresh look at Online Arithmetic, originally proposed for digit serial operation, and synthesize unrolled digit parallel online arithmetic operators to allow for graceful degradation. We quantify the impact of timing violations on key arithmetic primitives, and show that substantial performance benefits can be obtained in comparison to binary arithmetic. Since timing errors are caused by long carry chains, these result in errors in least significant digits with online arithmetic, causing less impact than conventional implementations.Open Acces

    Power-Aware Design Methodologies for FPGA-Based Implementation of Video Processing Systems

    Get PDF
    The increasing capacity and capabilities of FPGA devices in recent years provide an attractive option for performance-hungry applications in the image and video processing domain. FPGA devices are often used as implementation platforms for image and video processing algorithms for real-time applications due to their programmable structure that can exploit inherent spatial and temporal parallelism. While performance and area remain as two main design criteria, power consumption has become an important design goal especially for mobile devices. Reduction in power consumption can be achieved by reducing the supply voltage, capacitances, clock frequency and switching activities in a circuit. Switching activities can be reduced by architectural optimization of the processing cores such as adders, multipliers, multiply and accumulators (MACS), etc. This dissertation research focuses on reducing the switching activities in digital circuits by considering data dependencies in bit level, word level and block level neighborhoods in a video frame. The bit level data neighborhood dependency consideration for power reduction is illustrated in the design of pipelined array, Booth and log-based multipliers. For an array multiplier, operands of the multipliers are partitioned into higher and lower parts so that the probability of the higher order parts being zero or one increases. The gating technique for the pipelined approach deactivates part(s) of the multiplier when the above special values are detected. For the Booth multiplier, the partitioning and gating technique is integrated into the Booth recoding scheme. In addition, a delay correction strategy is developed for the Booth multiplier to reduce the switching activities of the sign extension part in the partial products. A novel architecture design for the computation of log and inverse-log functions for the reduction of power consumption in arithmetic circuits is also presented. This also utilizes the proposed partitioning and gating technique for further dynamic power reduction by reducing the switching activities. The word level and block level data dependencies for reducing the dynamic power consumption are illustrated by presenting the design of a 2-D convolution architecture. Here the similarities of the neighboring pixels in window-based operations of image and video processing algorithms are considered for reduced switching activities. A partitioning and detection mechanism is developed to deactivate the parallel architecture for window-based operations if higher order parts of the pixel values are the same. A neighborhood dependent approach (NDA) is incorporated with different window buffering schemes. Consideration of the symmetry property in filter kernels is also applied with the NDA method for further reduction of switching activities. The proposed design methodologies are implemented and evaluated in a FPGA environment. It is observed that the dynamic power consumption in FPGA-based circuit implementations is significantly reduced in bit level, data level and block level architectures when compared to state-of-the-art design techniques. A specific application for the design of a real-time video processing system incorporating the proposed design methodologies for low power consumption is also presented. An image enhancement application is considered and the proposed partitioning and gating, and NDA methods are utilized in the design of the enhancement system. Experimental results show that the proposed multi-level power aware methodology achieves considerable power reduction. Research work is progressing In utilizing the data dependencies in subsequent frames in a video stream for the reduction of circuit switching activities and thereby the dynamic power consumption

    Acceleration Techniques for Sparse Recovery Based Plane-wave Decomposition of a Sound Field

    Get PDF
    Plane-wave decomposition by sparse recovery is a reliable and accurate technique for plane-wave decomposition which can be used for source localization, beamforming, etc. In this work, we introduce techniques to accelerate the plane-wave decomposition by sparse recovery. The method consists of two main algorithms which are spherical Fourier transformation (SFT) and sparse recovery. Comparing the two algorithms, the sparse recovery is the most computationally intensive. We implement the SFT on an FPGA and the sparse recovery on a multithreaded computing platform. Then the multithreaded computing platform could be fully utilized for the sparse recovery. On the other hand, implementing the SFT on an FPGA helps to flexibly integrate the microphones and improve the portability of the microphone array. For implementing the SFT on an FPGA, we develop a scalable FPGA design model that enables the quick design of the SFT architecture on FPGAs. The model considers the number of microphones, the number of SFT channels and the cost of the FPGA and provides the design of a resource optimized and cost-effective FPGA architecture as the output. Then we investigate the performance of the sparse recovery algorithm executed on various multithreaded computing platforms (i.e., chip-multiprocessor, multiprocessor, GPU, manycore). Finally, we investigate the influence of modifying the dictionary size on the computational performance and the accuracy of the sparse recovery algorithms. We introduce novel sparse-recovery techniques which use non-uniform dictionaries to improve the performance of the sparse recovery on a parallel architecture

    KAVUAKA: a low-power application-specific processor architecture for digital hearing aids

    Get PDF
    The power consumption of digital hearing aids is very restricted due to their small physical size and the available hardware resources for signal processing are limited. However, there is a demand for more processing performance to make future hearing aids more useful and smarter. Future hearing aids should be able to detect, localize, and recognize target speakers in complex acoustic environments to further improve the speech intelligibility of the individual hearing aid user. Computationally intensive algorithms are required for this task. To maintain acceptable battery life, the hearing aid processing architecture must be highly optimized for extremely low-power consumption and high processing performance.The integration of application-specific instruction-set processors (ASIPs) into hearing aids enables a wide range of architectural customizations to meet the stringent power consumption and performance requirements. In this thesis, the application-specific hearing aid processor KAVUAKA is presented, which is customized and optimized with state-of-the-art hearing aid algorithms such as speaker localization, noise reduction, beamforming algorithms, and speech recognition. Specialized and application-specific instructions are designed and added to the baseline instruction set architecture (ISA). Among the major contributions are a multiply-accumulate (MAC) unit for real- and complex-valued numbers, architectures for power reduction during register accesses, co-processors and a low-latency audio interface. With the proposed MAC architecture, the KAVUAKA processor requires 16 % less cycles for the computation of a 128-point fast Fourier transform (FFT) compared to related programmable digital signal processors. The power consumption during register file accesses is decreased by 6 %to 17 % with isolation and by-pass techniques. The hardware-induced audio latency is 34 %lower compared to related audio interfaces for frame size of 64 samples.The final hearing aid system-on-chip (SoC) with four KAVUAKA processor cores and ten co-processors is integrated as an application-specific integrated circuit (ASIC) using a 40 nm low-power technology. The die size is 3.6 mm2. Each of the processors and co-processors contains individual customizations and hardware features with a varying datapath width between 24-bit to 64-bit. The core area of the 64-bit processor configuration is 0.134 mm2. The processors are organized in two clusters that share memory, an audio interface, co-processors and serial interfaces. The average power consumption at a clock speed of 10 MHz is 2.4 mW for SoC and 0.6 mW for the 64-bit processor.Case studies with four reference hearing aid algorithms are used to present and evaluate the proposed hardware architectures and optimizations. The program code for each processor and co-processor is generated and optimized with evolutionary algorithms for operation merging,instruction scheduling and register allocation. The KAVUAKA processor architecture is com-pared to related processor architectures in terms of processing performance, average power consumption, and silicon area requirements
    • …
    corecore