81 research outputs found

    VLSI ARCHITECTURE OF PARALLEL MULTIPLIER– ACCUMULATOR BASED ON RADIX-2 MODIFIED BOOTH ALGORITHM

    Get PDF
    A new architecture of multiplier-andaccumulator (MAC) for high-speed arithmetic. By combining multiplication with accumulation and devising a hybrid type of carry save adder (CSA), the performance was improved. Since the accumulator that has the largest delay in MAC was merged into CSA, the overall performance was elevated. The proposing method CSA tree uses 1’s-complement-based radix-2 modified Booth’s algorithm (MBA) and has the modified array for the sign extension in order to increase the bit density of the operands. The proposed MAC showed the superior properties to the standard design in many ways and performance twice as much as the previous research in the similar clock frequency. We expect that the proposed MAC can be adapted to various fields requiring high performance such as the signal processing areas

    Analysis and Evaluation of MAC Operators for Fast Fourier Transformation

    Get PDF
    Arithmetic tasks are broadly utilized in Digital Signal Processing (DSP) applications. In this paper, a streamlined plan of the melded Add-Multiply (FAM) administrators is being investigated for the expanding execution. The direct plan of an AM unit is executed by apportioning a snake and afterward driving its yield to the contribution of a multiplier, increments essentially both region and basic way postponement of the circuit. The immediate recoding of the entirety of two numbers in its MB structure prompts a progressively effective execution of the intertwined Add-Multiply unit contrasted with the regular one, earlier recoding plans depend on complex controls in bit-level, which are actualized by committed circuits in entryway level. This new recoding plan and Modified CSA Tree, diminishes the basic way delay and decreases power utilization. This paper focuses on the extra decrease of dormancy and force utilization of CSA tree multiplier. This is cultivated by the utilization of Modified stall ADD-Multiply administrator and 4:2 compressor adders. Three elective plans of the proposed S-MB approach utilizing regular and marked piece Full Adders (FAs) and Half Adders (HAs) are being investigated as building squares

    Design of ALU and Cache Memory for an 8 bit ALU

    Get PDF
    The design of an ALU and a Cache memory for use in a high performance processor was examined in this thesis. Advanced architectures employing increased parallelism were analyzed to minimize the number of execution cycles needed for 8 bit integer arithmetic operations. In addition to the arithmetic unit, an optimized SRAM memory cell was designed to be used as cache memory and as fast Look Up Table. The ALU consists of stand alone units for bit parallel computation of basic integer arithmetic operations. Addition and subtraction were performed using Kogge Stone parallel prefix hardware operating at 330MHz. A high performance multiplier was built using Radix 4 Modified Booth Encoder (MBE) and a Wallace Tree summation array. The multiplier requires single clock cycle for 8 bit integer multiplication and operates at a maximum frequency of 100MHz. Multiplicative division hardware was built for executing both integer division and square root. The division hardware computes 8-bit division and square root in 4 clock cycles. Multiplier forms the basic building block of all these functional units, making high level of resource sharing feasible with this architecture. The optimal operating frequency for the arithmetic unit is 70MHz. A 6T CMOS SRAM cell measuring 90 µm2 was designed using minimum size transistors. The layout allows for horizontal overlap resulting in effective area of 76 µm2 for an 8x8 array. By substituting equivalent bit line capacitance of P4 L1 Cache, the memory was simulated to have a read time of 3.27ns. An optimized set of test vectors were identified to enable high fault coverage without the need for any additional test circuitry. Sixteen test cases were identified that would toggle all the nodes and provide all possible inputs to the sub units of the multiplier. A correlation based semi automatic method was investigated to facilitate test case identification for large multipliers. This method of testability eliminates performance and area overhead associated with conventional testability hardware. Bottom up design methodology was employed for the design. The performance and area metrics are presented along with estimated power consumption. A set of Monte Carlo analysis was carried out to ensure the dependability of the design under process variations as well as fluctuations in operating conditions. The arithmetic unit was found to require a total die area of 2mm2 (approx.) in 0.35 micron process

    Improving the Hardware Performance of Arithmetic Circuits using Approximate Computing

    Get PDF
    An application that can produce a useful result despite some level of computational error is said to be error resilient. Approximate computing can be applied to error resilient applications by intentionally introducing error to the computation in order to improve performance, and it has been shown that approximation is especially well-suited for application in arithmetic computing hardware. In this thesis, novel approximate arithmetic architectures are proposed for three different operations, namely multiplication, division, and the multiply accumulate (MAC) operation. For all designs, accuracy is evaluated in terms of mean relative error distance (MRED) and normalized mean error distance (NMED), while hardware performance is reported in terms of critical path delay, area, and power consumption. Three approximate Booth multipliers (ABM-M1, ABM-M2, ABM-M3) are designed in which two novel inexact partial product generators are used to reduce the dimensions of the partial product matrix. The proposed multipliers are compared to other state-of-the-art designs in terms of both accuracy and hardware performance, and are found to reduce power consumption by up to 56% when compared to the exact multiplier. The function of the multipliers is verified in several image processing applications. Two approximate restoring dividers (AXRD-M1, AXRD-M2) are proposed along with a novel inexact restoring divider cell. In the first divider, the conventional cells are replaced with the proposed inexact cells in several columns. The second divider computes only a subset of the trial subtractions, after which the divisor and partial remainder are rounded and encoded so that they may be used to estimate the remaining quotient bits. The proposed dividers are evaluated for accuracy and hardware performance alongside several benchmarking designs, and their function is verified using change detection and foreground extraction applications. An approximate MAC unit is presented in which the multiplication is implemented using a modified version of ABM-M3. The delay is reduced by using a fused architecture where the accumulator is summed as part of the multiplier compression. The accuracy and hardware savings of the MAC unit are measured against several works from the literature, and the design is utilized in a number of convolution operations

    Energy-Efficient Neural Network Architectures

    Full text link
    Emerging systems for artificial intelligence (AI) are expected to rely on deep neural networks (DNNs) to achieve high accuracy for a broad variety of applications, including computer vision, robotics, and speech recognition. Due to the rapid growth of network size and depth, however, DNNs typically result in high computational costs and introduce considerable power and performance overheads. Dedicated chip architectures that implement DNNs with high energy efficiency are essential for adding intelligence to interactive edge devices, enabling them to complete increasingly sophisticated tasks by extending battery lie. They are also vital for improving performance in cloud servers that support demanding AI computations. This dissertation focuses on architectures and circuit technologies for designing energy-efficient neural network accelerators. First, a deep-learning processor is presented for achieving ultra-low power operation. Using a heterogeneous architecture that includes a low-power always-on front-end and a selectively-enabled high-performance back-end, the processor dynamically adjusts computational resources at runtime to support conditional execution in neural networks and meet performance targets with increased energy efficiency. Featuring a reconfigurable datapath and a memory architecture optimized for energy efficiency, the processor supports multilevel dynamic activation of neural network segments, performing object detection tasks with 5.3x lower energy consumption in comparison with a static execution baseline. Fabricated in 40nm CMOS, the processor test-chip dissipates 0.23mW at 5.3 fps. It demonstrates energy scalability up to 28.6 TOPS/W and can be configured to run a variety of workloads, including severely power-constrained ones such as always-on monitoring in mobile applications. To further improve the energy efficiency of the proposed heterogeneous architecture, a new charge-recovery logic family, called zero-short-circuit current (ZSCC) logic, is proposed to decrease the power consumption of the always-on front-end. By relying on dedicated circuit topologies and a four-phase clocking scheme, ZSCC operates with significantly reduced short-circuit currents, realizing order-of-magnitude power savings at relatively low clock frequencies (in the order of a few MHz). The efficiency and applicability of ZSCC is demonstrated through an ANSI S1.11 1/3 octave filter bank chip for binaural hearing aids with two microphones per ear. Fabricated in a 65nm CMOS process, this charge-recovery chip consumes 13.8µW with a 1.75MHz clock frequency, achieving 9.7x power reduction per input in comparison with a 40nm monophonic single-input chip that represents the published state of the art. The ability of ZSCC to further increase the energy efficiency of the heterogeneous neural network architecture is demonstrated through the design and evaluation of a ZSCC-based front-end. Simulation results show 17x power reduction compared with a conventional static CMOS implementation of the same architecture.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147614/1/hsiwu_1.pd

    Application-specific instruction set processor for speech recognition.

    Get PDF
    Cheung Man Ting.Thesis (M.Phil.)--Chinese University of Hong Kong, 2005.Includes bibliographical references (leaves 69-71).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- The Emergence of ASIP --- p.1Chapter 1.1.1 --- Related Work --- p.3Chapter 1.2 --- Motivation --- p.6Chapter 1.3 --- ASIP Design Methodologies --- p.7Chapter 1.4 --- Fundamentals of Speech Recognition --- p.8Chapter 1.5 --- Thesis outline --- p.10Chapter 2 --- Automatic Speech Recognition --- p.11Chapter 2.1 --- Overview of ASR system --- p.11Chapter 2.2 --- Theory of Front-end Feature Extraction --- p.12Chapter 2.3 --- Theory of HMM-based Speech Recognition --- p.14Chapter 2.3.1 --- Hidden Markov Model (HMM) --- p.14Chapter 2.3.2 --- The Typical Structure of the HMM --- p.14Chapter 2.3.3 --- Discrete HMMs and Continuous HMMs --- p.15Chapter 2.3.4 --- The Three Basic Problems for HMMs --- p.17Chapter 2.3.5 --- Probability Evaluation --- p.18Chapter 2.4 --- The Viterbi Search Engine --- p.19Chapter 2.5 --- Isolated Word Recognition (IWR) --- p.22Chapter 3 --- Design of ASIP Platform --- p.24Chapter 3.1 --- Instruction Fetch --- p.25Chapter 3.2 --- Instruction Decode --- p.26Chapter 3.3 --- Datapath --- p.29Chapter 3.4 --- Register File Systems --- p.30Chapter 3.4.1 --- Memory Hierarchy --- p.30Chapter 3.4.2 --- Register File Organization --- p.31Chapter 3.4.3 --- Special Registers --- p.34Chapter 3.4.4 --- Address Generation --- p.34Chapter 3.4.5 --- Load and Store --- p.36Chapter 4 --- Implementation of Speech Recognition on ASIP --- p.37Chapter 4.1 --- Hardware Architecture Exploration --- p.37Chapter 4.1.1 --- Floating Point and Fixed Point --- p.37Chapter 4.1.2 --- Multiplication and Accumulation --- p.38Chapter 4.1.3 --- Pipelining --- p.41Chapter 4.1.4 --- Memory Architecture --- p.43Chapter 4.1.5 --- Saturation Logic --- p.44Chapter 4.1.6 --- Specialized Addressing Modes --- p.44Chapter 4.1.7 --- Repetitive Operation --- p.47Chapter 4.2 --- Software Algorithm Implementation --- p.49Chapter 4.2.1 --- Implementation Using Base Instruction Set --- p.49Chapter 4.2.2 --- Implementation Using Refined Instruction Set --- p.54Chapter 5 --- Simulation Results --- p.56Chapter 6 --- Conclusions and Future Work --- p.60Appendices --- p.62Chapter A --- Base Instruction Set --- p.62Chapter B --- Special Registers --- p.65Chapter C --- Chip Microphotograph of ASIP --- p.67Chapter D --- The Testing Board of ASIP --- p.68Bibliography --- p.6

    Flexible Multiple-Precision Fused Arithmetic Units for Efficient Deep Learning Computation

    Get PDF
    Deep Learning has achieved great success in recent years. In many fields of applications, such as computer vision, biomedical analysis, and natural language processing, deep learning can achieve a performance that is even better than human-level. However, behind this superior performance is the expensive hardware cost required to implement deep learning operations. Deep learning operations are both computation intensive and memory intensive. Many research works in the literature focused on improving the efficiency of deep learning operations. In this thesis, special focus is put on improving deep learning computation and several efficient arithmetic unit architectures are proposed and optimized for deep learning computation. The contents of this thesis can be divided into three parts: (1) the optimization of general-purpose arithmetic units for deep learning computation; (2) the design of deep learning specific arithmetic units; (3) the optimization of deep learning computation using 3D memory architecture. Deep learning models are usually trained on graphics processing unit (GPU) and the computations are done with single-precision floating-point numbers. However, recent works proved that deep learning computation can be accomplished with low precision numbers. The half-precision numbers are becoming more and more popular in deep learning computation due to their lower hardware cost compared to the single-precision numbers. In conventional floating-point arithmetic units, single-precision and beyond are well supported to achieve a better precision. However, for deep learning computation, since the computations are intensive, low precision computation is desired to achieve better throughput. As the popularity of half-precision raises, half-precision operations are also need to be supported. Moreover, the deep learning computation contains many dot-product operations and therefore, the support of mixed-precision dot-product operations can be explored in a multiple-precision architecture. In this thesis, a multiple-precision fused multiply-add (FMA) architecture is proposed. It supports half/single/double/quadruple-precision FMA operations. In addition, it also supports 2-term mixed-precision dot-product operations. Compared to the conventional multiple-precision FMA architecture, the newly added half-precision support and mixed-precision dot-product only bring minor resource overhead. The proposed FMA can be used as general-purpose arithmetic unit. Due to the support of parallel half-precision computations and mixed-precision dot-product computations, it is especially suitable for deep learning computation. For the design of deep learning specific computation unit, more optimizations can be performed. First, a fixed-point and floating-point merged multiply-accumulate (MAC) unit is proposed. As deep learning computation can be accomplished with low precision number formats, the support of high precision floating-point operations can be eliminated. In this design, the half-precision floating-point format is supported to provide a large dynamic range to handle small gradients for deep learning training. For deep learning inference, 8-bit fixed-point 2-term dot-product computation is supported. Second, a flexible multiple-precision MAC unit architecture is proposed. The proposed MAC unit supports both fixed-point operations and floating-point operations. For floating-point format, the proposed unit supports one 16-bit MAC operation or sum of two 8-bit multiplications plus a 16-bit addend. To make the proposed MAC unit more versatile, the bit-width of exponent and mantissa can be flexibly exchanged. By setting the bit-width of exponent to zero, the proposed MAC unit also supports fixed-point operations. For fixed-point format, the proposed unit supports one 16-bit MAC or sum of two 8-bit multiplications plus a 16-bit addend. Moreover, the proposed unit can be further divided to support sum of four 4-bit multiplications plus a 16-bit addend. At the lowest precision, the proposed MAC unit supports accumulating of eight 1-bit logic AND operations to enable the support of binary neural networks. Finally, a MAC architecture based on the posit format, a promising numerical format in deep learning computation, is proposed to facilitate the use of posit format in deep learning computation. In addition to the above mention arithmetic units, an improved hybrid memory cube (HMC) architecture is proposed for weight-sharing deep neural network processing. By modifying the HMC instruction set and HMC logic layer, the major part of the deep learning computation can be accomplished inside memory. The proposed design reduces the memory bandwidth requirements and thus reduces the energy consumed by memory data transfer

    Design and realization of a high speed 64 x 64 - bit multiplier for low power applications

    Get PDF
    Wireless communication systems, including third generation cellular radio systems and wireless LANs, have become tremendously popular in recent years. These systems can be implemented using various platforms, like digital signal processors, ASICs and FPGAs. Most digital signal processing systems incorporate a multiplication unit to implement algorithms such as correlations, convolution, filtering and frequency analysis. These algorithms are used in applications such as finite impulse filters (FIR), infinite impulse filters (IIR), discrete cosine transforms (DCT) and fast Fourier transforms (FFT). Moreover, there has been a rapid increase in the popularity of portable and wireless electronic devices, like laptop computers, portable video players and cellular phones, which rely on embedded digital signal processors. Since the desire is to design digital systems for communication applications at best performance without power sacrifices, the need for high performance and low power multipliers is inevitable. Since multiplication is one of the most critical operations in many computational systems, there have been many algorithm proposals in the literature to perform multiplication, each offering different advantages and having tradeoffs in terms of speed, circuit complexity, area and power consumption. This thesis focuses on an ASIC implementation of a multiplexer-based multiplication method, an efficient algorithm which is applicable to low power applications. Recently, it has been proved that the multiplexer-based multiplier outperforms the modified Booth multiplier both in speed and power dissipation by 13% to 26%, due to small internal capacitance. After analyzing the performance characteristics of conventional multiplier types, it is observed that the one designed using multiplexer-based multiplication algorithm is more advantageous, especially when the size of the multiplied numbers is small. In order to verify the superiorities of this algorithm, we performed an implementation, in which the bit size of the multiplicand and the multiplier is comparably large. Thus, realization of a 64 x 64-bit multiplier block has been done in 0.35 M [micron] CMOS technology using Cadence Design Framework tools. The final multiplier structure operates at 12.8 ns with an approximate dynamic power consumption of 1mW. Also, using the same algorithm, another block of 32-bit x 32-bit multiplier is designed and is sent for fabrication
    • …
    corecore