604 research outputs found

    Low-Power, Low-Cost, & High-Performance Digital Designs : Multi-bit Signed Multiplier design using 32nm CMOS Technology

    Get PDF
    Binary multipliers are ubiquitous in digital hardware. Digital multipliers along with the adders play a major role in computing, communicating, and controlling devices. Multipliers are used majorly in the areas of digital signal and image processing, central processing unit (CPU) of the computers, high-performance and parallel scientific computing, machine learning, physical layer design of the communication equipment, etc. The predominant presence and increasing demand for low-power, low-cost, and high-performance digital hardware led to this work of developing optimized multiplier designs. Two optimized designs are proposed in this work. One is an optimized 8 x 8 Booth multiplier architecture which is implemented using 32nm CMOS technology. Synthesis (pre-layout) and post-layout results show that the delay is reduced by 24.7% and 25.6% respectively, the area is reduced by 5.5% and 15% respectively, the power consumption is reduced by 21.5% and 26.6% respectively, and the area-delay-product is reduced by 28.8% and 36.8% respectively when compared to the performance results obtained for the state-of-the-art 8 x 8 Booth multiplier designed using 32nm CMOS technology with 1.05 V supply voltage at 500 MHz input frequency. Another is a novel radix-8 structure with 3-bit grouping to reduce the number of partial products along with the effective partial product reduction schemes for 8 x 8, 16 x 16, 32 x 32, and 64 x 64 signed multipliers. Comparing the performance results of the (synthesized, post-layout) designs of sizes 32 x 32, and 64 x 64 based on the simple novel radix-8 structure with the estimated performance measurements for the optimized Booth multiplier design presented in this work, reduction in delay by (2.64%, 0.47%) and (2.74%, 18.04%) respectively, and reduction in area-delay-product by (12.12%, -5.17%) and (17.82%, 12.91%) respectively can be observed. With the use of the higher radix structure, delay, area, and power consumption can be further reduced. Appropriate adder deployment, further exploring the optimized grouping or compression strategies, and applying more low-power design techniques such as power-gating, multi-Vt MOS transistor utilization, multi-VDD domain creation, etc., help, along with the higher radix structures, realizing the more efficient multiplier designs

    Increasing adder efficiency by exploiting input statistics

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, February 2008.Includes bibliographical references (p. 49-50).Current techniques for characterizing the power consumption of adders rely on assuming that the inputs are completely random. However, the inputs generated by realistic applications are not random, and in fact include a great deal of structure. Input bits are more likely to remain in the same logical states from addition to addition than would be expected by chance and bits, especially the most significant bits, are very likely to be in the same state as their neighbors. Taking this data, I look at ways that it can be used to improve the design of adders. The first method I look at involves looking at how different adder architectures respond to the different characteristics of input data from the more significant and less significant bits of the adder, and trying to use these responses to create a hybrid adder. Unfortunately the differences are not sufficient for this approach to be effective. I next look at the implications of the data I collected for the optimization of Kogge- Stone adder trees, and find that in certain circumstances the use of experimentally derived activity maps rather than ones based on simple assumptions can increase adder performance by as much as 30%.by Andrew Lawrence Clough.M.Eng

    VLSI design of high-speed adders for digital signal processing applications.

    Get PDF

    Design of ALU and Cache Memory for an 8 bit ALU

    Get PDF
    The design of an ALU and a Cache memory for use in a high performance processor was examined in this thesis. Advanced architectures employing increased parallelism were analyzed to minimize the number of execution cycles needed for 8 bit integer arithmetic operations. In addition to the arithmetic unit, an optimized SRAM memory cell was designed to be used as cache memory and as fast Look Up Table. The ALU consists of stand alone units for bit parallel computation of basic integer arithmetic operations. Addition and subtraction were performed using Kogge Stone parallel prefix hardware operating at 330MHz. A high performance multiplier was built using Radix 4 Modified Booth Encoder (MBE) and a Wallace Tree summation array. The multiplier requires single clock cycle for 8 bit integer multiplication and operates at a maximum frequency of 100MHz. Multiplicative division hardware was built for executing both integer division and square root. The division hardware computes 8-bit division and square root in 4 clock cycles. Multiplier forms the basic building block of all these functional units, making high level of resource sharing feasible with this architecture. The optimal operating frequency for the arithmetic unit is 70MHz. A 6T CMOS SRAM cell measuring 90 µm2 was designed using minimum size transistors. The layout allows for horizontal overlap resulting in effective area of 76 µm2 for an 8x8 array. By substituting equivalent bit line capacitance of P4 L1 Cache, the memory was simulated to have a read time of 3.27ns. An optimized set of test vectors were identified to enable high fault coverage without the need for any additional test circuitry. Sixteen test cases were identified that would toggle all the nodes and provide all possible inputs to the sub units of the multiplier. A correlation based semi automatic method was investigated to facilitate test case identification for large multipliers. This method of testability eliminates performance and area overhead associated with conventional testability hardware. Bottom up design methodology was employed for the design. The performance and area metrics are presented along with estimated power consumption. A set of Monte Carlo analysis was carried out to ensure the dependability of the design under process variations as well as fluctuations in operating conditions. The arithmetic unit was found to require a total die area of 2mm2 (approx.) in 0.35 micron process

    Implementing Energy Parsimonious Circuits through Inexact Designs

    Get PDF
    Inexact Circuits or circuits in which accuracy of the output can be traded for cost (energy, delay and/or area) savings, have been receiving increasing attention of late due to invariable inaccuracies in nanometer-scale circuits and a concomitant growing desire for ultra low energy embedded systems. Most of the previous approaches to realize inexact circuits relied on scaling of circuit-level operational parameters (such as supply voltage) to achieve the cost and accuracy tradeoffs, and suffered from serious drawbacks of significant implementation overheads that drastically reduced the gains. In this thesis, two novel architecture-level approaches called Probabilisttc Pruning and Probabilistic Logic Minimization are proposed to realize inexact circuits with zero overhead. Extensive simulations on various architectures of datapath elements and a prototype chip fabrication demonstrate that normalized gains as large as 2X-9.5X in Energy-Delay-Area product can be obtained for relative error as low as 10 -6 % - 1% compared to corresponding conventional correct designs

    Design of Energy-Efficient Approximate Arithmetic Circuits

    Get PDF
    Energy consumption has become one of the most critical design challenges in integrated circuit design. Arithmetic computing circuits, in particular array-based arithmetic computing circuits such as adders, multipliers, squarers, have been widely used. In many cases, array-based arithmetic computing circuits consume a significant amount of energy in a chip design. Hence, reduction of energy consumption of array-based arithmetic computing circuits is an important design consideration. To this end, designing low-power arithmetic circuits by intelligently trading off processing precision for energy saving in error-resilient applications such as DSP, machine learning and neuromorphic circuits provides a promising solution to the energy dissipation challenge of such systems. To solve the chip’s energy problem, especially for those applications with inherent error resilience, array-based approximate arithmetic computing (AAAC) circuits that produce errors while having improved energy efficiency have been proposed. Specifically, a number of approximate adders, multipliers and squarers have been presented in the literature. However, the chief limitation of these designs is their un-optimized processing accuracy, which is largely due to the current lack of systemic guidance for array-based AAAC circuit design pertaining to optimal tradeoffs between error, energy and area overhead. Therefore, in this research, our first contribution is to propose a general model for approximate array-based approximate arithmetic computing to guide the minimization of processing error. As part of this model, the Error Compensation Unit (ECU) is identified as a key building block for a wide range of AAAC circuits. We develop theoretical analysis geared towards addressing two critical design problems of the ECU, namely, determination of optimal error compensation values and identification of the optimal error compensation scheme. We demonstrate how this general AAAC model can be leveraged to derive practical design insights that may lead to optimal tradeoffs between accuracy, energy dissipation and area overhead. To further minimize energy consumption, delay and area of AAAC circuits, we perform ECU logic simplification by introducing don't cares. By applying the proposed model, we propose an approximate 16x16 fixed-width Booth multiplier that consumes 44.85% and 28.33% less energy and area compared with theoretically the most accurate fixed-width Booth multiplier when implemented using a 90nm CMOS standard cell library. Furthermore, it reduces average error, max error and mean square error by 11.11%, 28.11% and 25.00%, respectively, when compared with the best reported approximate Booth multiplier and outperforms the best reported approximate design significantly by 19.10% in terms of the energy-delay-mean square error product (EDE_(ms)). Using the same approach, significant energy consumption, area and error reduction is achieved for a squarer unit, with more than 20.00% EDE_(ms) reduction over existing fixed-width squarer designs. To further reduce error and cost by utilizing extra signatures and don't cares, we demonstrate a 16-bit fixed-width squarer that improves the energy-delay-max error (EDE_(max)) by 15.81%

    Energy Aware Design and Analysis for Synchronous and Asynchronous Circuits

    Get PDF
    Power dissipation has become a major concern for IC designers. Various low power design techniques have been developed for synchronous circuits. Asynchronous circuits, however. have gained more interests recently due to their benefits in lower noise, easy timing control, etc. But few publications on energy reduction techniques for asynchronous logic are available. Power awareness indicates the ability of the system power to scale with changing conditions and quality requirements. Scalability is an important figure-of-merit since it allows the end user to implement operational policy. just like the user of mobile multimedia equipment needs to select between better quality and longer battery operation time. This dissertation discusses power/energy optimization and performs analysis on both synchronous and asynchronous logic. The major contributions of this dissertation include: 1 ) A 2-Dimensional Pipeline Gating technique for synchronous pipelined circuits to improve their power awareness has been proposed. This technique gates the corresponding clock lines connected to registers in both vertical direction (the data flow direction) and horizontal direction (registers within each pipeline stage) based on current input precision. 2) Two energy reduction techniques, Signal Bypassing & Insertion and Zero Insertion. have been developed for NCL circuits. Both techniques use Nulls to replace redundant Data 0\u27s based on current input precision in order to reduce the switching activity while Signal Bypassing & Insertion is for non-pipelined NCI, circuits and Zero Insertion is for pipelined counterparts. A dynamic active-bit detection scheme is also developed as an expansion. 3) Two energy estimation techniques, Equivalent Inverter Modeling based on Input Mapping in transistor-level and Switching Activity Modeling in gate-level, have been proposed. The former one is for CMOS gates with feedbacks and the latter one is for NCL circuits

    Energy-efficient embedded machine learning algorithms for smart sensing systems

    Get PDF
    Embedded autonomous electronic systems are required in numerous application domains such as Internet of Things (IoT), wearable devices, and biomedical systems. Embedded electronic systems usually host sensors, and each sensor hosts multiple input channels (e.g., tactile, vision), tightly coupled to the electronic computing unit (ECU). The ECU extracts information by often employing sophisticated methods, e.g., Machine Learning. However, embedding Machine Learning algorithms poses essential challenges in terms of hardware resources and energy consumption because of: 1) the high amount of data to be processed; 2) computationally demanding methods. Leveraging on the trade-off between quality requirements versus computational complexity and time latency could reduce the system complexity without affecting the performance. The objectives of the thesis are to develop: 1) energy-efficient arithmetic circuits outperforming state of the art solutions for embedded machine learning algorithms, 2) an energy-efficient embedded electronic system for the \u201celectronic-skin\u201d (e-skin) application. As such, this thesis exploits two main approaches: Approximate Computing: In recent years, the approximate computing paradigm became a significant major field of research since it is able to enhance the energy efficiency and performance of digital systems. \u201cApproximate Computing\u201d(AC) turned out to be a practical approach to trade accuracy for better power, latency, and size . AC targets error-resilient applications and offers promising benefits by conserving some resources. Usually, approximate results are acceptable for many applications, e.g., tactile data processing,image processing , and data mining ; thus, it is highly recommended to take advantage of energy reduction with minimal variation in performance . In our work, we developed two approximate multipliers: 1) the first one is called \u201cMETA\u201d multiplier and is based on the Error Tolerant Adder (ETA), 2) the second one is called \u201cApproximate Baugh-Wooley(BW)\u201d multiplier where the approximations are implemented in the generation of the partial products. We showed that the proposed approximate arithmetic circuits could achieve a relevant reduction in power consumption and time delay around 80.4% and 24%, respectively, with respect to the exact BW multiplier. Next, to prove the feasibility of AC in real world applications, we explored the approximate multipliers on a case study as the e-skin application. The e-skin application is defined as multiple sensing components, including 1) structural materials, 2) signal processing, 3) data acquisition, and 4) data processing. Particularly, processing the originated data from the e-skin into low or high-level information is the main problem to be addressed by the embedded electronic system. Many studies have shown that Machine Learning is a promising approach in processing tactile data when classifying input touch modalities. In our work, we proposed a methodology for evaluating the behavior of the system when introducing approximate arithmetic circuits in the main stages (i.e., signal and data processing stages) of the system. Based on the proposed methodology, we first implemented the approximate multipliers on the low-pass Finite Impulse Response (FIR) filter in the signal processing stage of the application. We noticed that the FIR filter based on (Approx-BW) outperforms state of the art solutions, while respecting the tradeoff between accuracy and power consumption, with an SNR degradation of 1.39dB. Second, we implemented approximate adders and multipliers respectively into the Coordinate Rotational Digital Computer (CORDIC) and the Singular Value Decomposition (SVD) circuits; since CORDIC and SVD take a significant part of the computationally expensive Machine Learning algorithms employed in tactile data processing. We showed benefits of up to 21% and 19% in power reduction at the cost of less than 5% accuracy loss for CORDIC and SVD circuits when scaling the number of approximated bits. 2) Parallel Computing Platforms (PCP): Exploiting parallel architectures for near-threshold computing based on multi-core clusters is a promising approach to improve the performance of smart sensing systems. In our work, we exploited a novel computing platform embedding a Parallel Ultra Low Power processor (PULP), called \u201cMr. Wolf,\u201d for the implementation of Machine Learning (ML) algorithms for touch modalities classification. First, we tested the ML algorithms at the software level; for RGB images as a case study and tactile dataset, we achieved accuracy respectively equal to 97% and 83.5%. After validating the effectiveness of the ML algorithm at the software level, we performed the on-board classification of two touch modalities, demonstrating the promising use of Mr. Wolf for smart sensing systems. Moreover, we proposed a memory management strategy for storing the needed amount of trained tensors (i.e., 50 trained tensors for each class) in the on-chip memory. We evaluated the execution cycles for Mr. Wolf using a single core, 2 cores, and 3 cores, taking advantage of the benefits of the parallelization. We presented a comparison with the popular low power ARM Cortex-M4F microcontroller employed, usually for battery-operated devices. We showed that the ML algorithm on the proposed platform runs 3.7 times faster than ARM Cortex M4F (STM32F40), consuming only 28 mW. The proposed platform achieves 15 7 better energy efficiency than the classification done on the STM32F40, consuming 81mJ per classification and 150 pJ per operation

    Computer arithmetic based on the Continuous Valued Number System

    Get PDF
    • …
    corecore