45 research outputs found

    Design of ALU and Cache Memory for an 8 bit ALU

    Get PDF
    The design of an ALU and a Cache memory for use in a high performance processor was examined in this thesis. Advanced architectures employing increased parallelism were analyzed to minimize the number of execution cycles needed for 8 bit integer arithmetic operations. In addition to the arithmetic unit, an optimized SRAM memory cell was designed to be used as cache memory and as fast Look Up Table. The ALU consists of stand alone units for bit parallel computation of basic integer arithmetic operations. Addition and subtraction were performed using Kogge Stone parallel prefix hardware operating at 330MHz. A high performance multiplier was built using Radix 4 Modified Booth Encoder (MBE) and a Wallace Tree summation array. The multiplier requires single clock cycle for 8 bit integer multiplication and operates at a maximum frequency of 100MHz. Multiplicative division hardware was built for executing both integer division and square root. The division hardware computes 8-bit division and square root in 4 clock cycles. Multiplier forms the basic building block of all these functional units, making high level of resource sharing feasible with this architecture. The optimal operating frequency for the arithmetic unit is 70MHz. A 6T CMOS SRAM cell measuring 90 µm2 was designed using minimum size transistors. The layout allows for horizontal overlap resulting in effective area of 76 µm2 for an 8x8 array. By substituting equivalent bit line capacitance of P4 L1 Cache, the memory was simulated to have a read time of 3.27ns. An optimized set of test vectors were identified to enable high fault coverage without the need for any additional test circuitry. Sixteen test cases were identified that would toggle all the nodes and provide all possible inputs to the sub units of the multiplier. A correlation based semi automatic method was investigated to facilitate test case identification for large multipliers. This method of testability eliminates performance and area overhead associated with conventional testability hardware. Bottom up design methodology was employed for the design. The performance and area metrics are presented along with estimated power consumption. A set of Monte Carlo analysis was carried out to ensure the dependability of the design under process variations as well as fluctuations in operating conditions. The arithmetic unit was found to require a total die area of 2mm2 (approx.) in 0.35 micron process

    Scalable Test Generators for High-Speed Datapath Circuits

    Full text link
    This paper explores the design of efficient test sets and test-pattern generators for on-line BIST. The target applications are high-performance, scalable datapath circuits for which fast and complete fault coverage is required. Because of the presence of carry-lookahead, most existing BIST methods are unsuitable for these applications. High-level models are used to identify potential test sets for a small version of the circuit to be tested. Then a regular test set is extracted and a test generator TG is designed to meet the following goals: scalability, small test set size, full fault coverage, and very low hardware overhead. TG takes the form of a twisted ring counter with a small decoder array. We apply our technique to various datapath circuits including a carry-lookahead adder, an arithmetic-logic unit, and a multiplier-adder.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/43010/1/10836_2004_Article_154697.pd

    Reliable and Fault-Resilient Schemes for Efficient Radix-4 Complex Division

    Get PDF
    Complex division is commonly used in various applications in signal processing and control theory including astronomy and nonlinear RF measurements. Nevertheless, unless reliability and assurance are embedded into the architectures of such structures, the suboptimal (and thus erroneous) results could undermine the objectives of such applications. As such, in this thesis, we present schemes to provide complex number division architectures based on (Sweeney, Robertson, and Tocher) SRT-division with fault diagnosis mechanisms. Different fault resilient architectures are proposed in this thesis which can be tailored based on the eventual objectives of the designs in terms of area and time requirements, among which we pinpoint carefully the schemes based on recomputing with shifted operands (RESO) to be able to detect both natural and malicious faults and with proper modification achieve high throughputs. The design also implements a minimized look up table approach which favors in error detection based designs and provides high fault coverage with relatively-low overhead. Additionally, to benchmark the effectiveness of the proposed schemes, extensive fault diagnosis assessments are performed for the proposed designs through fault simulations and FPGA implementations; the design is implemented on Xilinx Spartan-VI and Xilinx Virtex-VI FPGA families

    THE DESIGN OF AN IC HALF PRECISION FLOATING POINT ARITHMETIC LOGIC UNIT

    Get PDF
    A 16 bit floating point (FP) Arithmetic Logic Unit (ALU) was designed and implemented in 0.35µm CMOS technology. Typical uses of the 16 bit FP ALU include graphics processors and embedded multimedia applications. The ALU of the modern microprocessors use a fused multiply add (FMA) design technique. An advantage of the FMA is to remove the need for a comparator which is required for a normal FP adder. The FMA consists of a multiplier, shifters, adders and rounding circuit. A fast multiplier based on the Wallace tree configuration was designed. The number of partial products was greatly reduced by the use of the modified booth encoder. The Wallace tree was chosen to reduce the number of reduction layers of partial products. The multiplier also involved the design of a pass transistor based 4:2 compressor. The average delay of the pass transistor based compressor was 55ps and was found to be 7 times faster than the full adder based 4:2 compressor. The shifters consist of separate left and right shifters using multiplexers. The shift amount is calculated using the exponents of the three operands. The addition operation is implemented using a carry skip adder (CSK). The average delay of the CSK was 1.05ns and was slower than the carry look ahead adder by about 400ps. The advantages of the CSK are reduced power, gate count and area when compared to the similar sized carry look ahead adder. The adder computes the addition of the multiplier result and the shifted value of the addend. In most modern computers, division is performed using software thereby eliminating the need for a separate hardware unit. FMA hardware unit was utilized to perform FP division. The FP divider uses the Newton Raphson algorithm to solve division by iteration. The initial approximated value with five bit accuracy was assumed to be pre-stored in cache memory and a separate clock cycle for cache read was assumed before the start of the FP division operation. In order to significantly reduce the area of the design, only one multiplier was used. Rounding to nearest technique was implemented using an 11 bit variable CSK adder. This is the best rounding technique when compared to other rounding techniques. In both the FMA and division, rounding was performed after the computation of the final result during the last clock cycle of operation. Testability analysis is performed for the multiplier which is the most complex and critical part of the FP ALU. The specific aim of testability was to ensure the correct operation of the multiplier and thus guarantee the correctness of the FMA circuit at the layout stage. The multiplier\u27s output was tested by identifying the minimal number of input vectors which toggle the inputs of the 4:2 compressors of the multiplier. The test vectors were identified in a semi automated manner using Perl scripting language. The multiplier was tested with a test set of thirty one vectors. The fault coverage of the multiplier was found to be 90.09%. The layout was implemented using IC station of Mentor Graphics CAD tool and resulted in a chip area of 1.96mm2. The specifications for basic arithmetic operations were met successfully. FP Division operation was completed within six clock cycles. The other arithmetic operations like FMA, FP addition, FP subtraction and FP multiplication were completed within three clock cycles

    Low power predictable memory and processing architectures

    Get PDF
    Great demand in power optimized devices shows promising economic potential and draws lots of attention in industry and research area. Due to the continuously shrinking CMOS process, not only dynamic power but also static power has emerged as a big concern in power reduction. Other than power optimization, average-case power estimation is quite significant for power budget allocation but also challenging in terms of time and effort. In this thesis, we will introduce a methodology to support modular quantitative analysis in order to estimate average power of circuits, on the basis of two concepts named Random Bag Preserving and Linear Compositionality. It can shorten simulation time and sustain high accuracy, resulting in increasing the feasibility of power estimation of big systems. For power saving, firstly, we take advantages of the low power characteristic of adiabatic logic and asynchronous logic to achieve ultra-low dynamic and static power. We will propose two memory cells, which could run in adiabatic and non-adiabatic mode. About 90% dynamic power can be saved in adiabatic mode when compared to other up-to-date designs. About 90% leakage power is saved. Secondly, a novel logic, named Asynchronous Charge Sharing Logic (ACSL), will be introduced. The realization of completion detection is simplified considerably. Not just the power reduction improvement, ACSL brings another promising feature in average power estimation called data-independency where this characteristic would make power estimation effortless and be meaningful for modular quantitative average case analysis. Finally, a new asynchronous Arithmetic Logic Unit (ALU) with a ripple carry adder implemented using the logically reversible/bidirectional characteristic exhibiting ultra-low power dissipation with sub-threshold region operating point will be presented. The proposed adder is able to operate multi-functionally

    Low-Power Design of Digital VLSI Circuits around the Point of First Failure

    Get PDF
    As an increase of intelligent and self-powered devices is forecasted for our future everyday life, the implementation of energy-autonomous devices that can wirelessly communicate data from sensors is crucial. Even though techniques such as voltage scaling proved to effectively reduce the energy consumption of digital circuits, additional energy savings are still required for a longer battery life. One of the main limitations of essentially any low-energy technique is the potential degradation of the quality of service (QoS). Thus, a thorough understanding of how circuits behave when operated around the point of first failure (PoFF) is key for the effective application of conventional energy-efficient methods as well as for the development of future low-energy techniques. In this thesis, a variety of circuits, techniques, and tools is described to reduce the energy consumption in digital systems when operated either in the safe and conservative exact region, close to the PoFF, or even inside the inexact region. A straightforward approach to reduce the power consumed by clock distribution while safely operating in the exact region is dual-edge-triggered (DET) clocking. However, the DET approach is rarely taken, primarily due to the perceived complexity of its integration. In this thesis, a fully automated design flow is introduced for applying DET clocking to a conventional single-edge-triggered (SET) design. In addition, the first static true-single-phase-clock DET flip-flop (DET-FF) that completely avoids clock-overlap hazards of DET registers is proposed. Even though the correct timing of synchronous circuits is ensured in worst-case conditions, the critical path might not always be excited. Thus, dynamic clock adjustment (DCA) has been proposed to trim any available dynamic timing margin by changing the operating clock frequency at runtime. This thesis describes a dynamically-adjustable clock generator (DCG) capable of modifying the period of the produced clock signal on a cycle-by-cycle basis that enables the DCA technique. In addition, a timing-monitoring sequential (TMS) that detects input transitions on either one of the clock phases to enable the selection of the best timing-monitoring strategy at runtime is proposed. Energy-quality scaling techniques aimat trading lower energy consumption for a small degradation on the QoS whenever approximations can be tolerated. In this thesis, a low-power methodology for the perturbation of baseline coefficients in reconfigurable finite impulse response (FIR) filters is proposed. The baseline coefficients are optimized to reduce the switching activity of the multipliers in the FIR filter, enabling the possibility of scaling the power consumption of the filter at runtime. The area as well as the leakage power of many system-on-chips is often dominated by embedded memories. Gain-cell embedded DRAM (GC-eDRAM) is a compact, low-power and CMOS-compatible alternative to the conventional static random-access memory (SRAM) when a higher memory density is desired. However, due to GC-eDRAMs relying on many interdependent variables, the adaptation of existing memories and the design of future GCeDRAMs prove to be highly complex tasks. Thus, the first modeling tool that estimates timing, memory availability, bandwidth, and area of GC-eDRAMs for a fast exploration of their design space is proposed in this thesis

    Evolutionary design of digital VLSI hardware

    Get PDF
    corecore