Search CORE

364 research outputs found

Design of ALU and Cache Memory for an 8 bit ALU

Author: Chandran Pravin chander
Publication venue: Clemson University Libraries
Publication date: 01/12/2007
Field of study

The design of an ALU and a Cache memory for use in a high performance processor was examined in this thesis. Advanced architectures employing increased parallelism were analyzed to minimize the number of execution cycles needed for 8 bit integer arithmetic operations. In addition to the arithmetic unit, an optimized SRAM memory cell was designed to be used as cache memory and as fast Look Up Table. The ALU consists of stand alone units for bit parallel computation of basic integer arithmetic operations. Addition and subtraction were performed using Kogge Stone parallel prefix hardware operating at 330MHz. A high performance multiplier was built using Radix 4 Modified Booth Encoder (MBE) and a Wallace Tree summation array. The multiplier requires single clock cycle for 8 bit integer multiplication and operates at a maximum frequency of 100MHz. Multiplicative division hardware was built for executing both integer division and square root. The division hardware computes 8-bit division and square root in 4 clock cycles. Multiplier forms the basic building block of all these functional units, making high level of resource sharing feasible with this architecture. The optimal operating frequency for the arithmetic unit is 70MHz. A 6T CMOS SRAM cell measuring 90 µm2 was designed using minimum size transistors. The layout allows for horizontal overlap resulting in effective area of 76 µm2 for an 8x8 array. By substituting equivalent bit line capacitance of P4 L1 Cache, the memory was simulated to have a read time of 3.27ns. An optimized set of test vectors were identified to enable high fault coverage without the need for any additional test circuitry. Sixteen test cases were identified that would toggle all the nodes and provide all possible inputs to the sub units of the multiplier. A correlation based semi automatic method was investigated to facilitate test case identification for large multipliers. This method of testability eliminates performance and area overhead associated with conventional testability hardware. Bottom up design methodology was employed for the design. The performance and area metrics are presented along with estimated power consumption. A set of Monte Carlo analysis was carried out to ensure the dependability of the design under process variations as well as fluctuations in operating conditions. The arithmetic unit was found to require a total die area of 2mm2 (approx.) in 0.35 micron process

Clemson University: TigerPrints

IEEE Compliant Double-Precision FPU and 64-bit ALU with Variable Latency Integer Divider

Author: Williams Ryan D
Publication venue: RIT Scholar Works
Publication date: 01/01/2007
Field of study

Together the arithmetic logic unit (ALU) and floating-point unit (FPU) perform all of the mathematical and logic operations of computer processors. Because they are used so prominently, they fall in the critical path of the central processing unit - often becoming the bottleneck, or limiting factor for performance. As such, the design of a high-speed ALU and FPU is vital to creating a processor capable of performing up to the demanding standards of today\u27s computer users. In this paper, both a 64-bit ALU and a 64-bit FPU are designed based on the reduced instruction set computer architecture. The ALU performs the four basic mathematical operations - addition, subtraction, multiplication and division - in both unsigned and two\u27s complement format, basic logic operations and shifting. The division algorithm is a novel approach, using a comparison multiples based SRT divider to create a variable latency integer divider. The floating-point unit performs the double-precision floating-point operations add, subtract, multiply and divide, in accordance with the IEEE 754 standard for number representation and rounding. The ALU and FPU were implemented in VHDL, simulated in ModelSim, and constrained and synthesized using Synopsys Design Compiler (2006.06). They were synthesized using TSMC 0.1 3nm CMOS technology. The timing, power and area synthesis results were recorded, and, where applicable, compared to those of the corresponding DesignWare components.The ALU synthesis reported an area of 122,215 gates, a power of 384 mW, and a delay of 2.89 ns - a frequency of 346 MHz. The FPU synthesis reported an area 84,440 gates, a delay of 2.82 ns and an operating frequency of 355 MHz. It has a maximum dynamic power of 153.9 mW

RIT Scholar Works

Low Latency Prefix Accumulation Driven Compound MAC Unit for Efficient FIR Filter Implementation

Author: Giriprasad M N
Hemantha G Reddy
Varadarajan S
Publication venue: NISCAIR-CSIR, India
Publication date: 01/02/2020
Field of study

135–138This article presents hierarchical single compound adder-based MAC with assertion based error correction for speculation variations in the prefix addition for FIR filter design. The VLSI implementation of approximation in prefix adder results show a significant delay and complexity reductions, all this at the cost of latency measures when speculation fails during carry propagation, which is the main reason preventing the use of speculation in parallel-prefix adders in DSP applications. The speculative adder which is based on Han Carlson parallel prefix adder structure accomplishes better reduction in latency. Introducing a structured and efficient shift-add technique and explore latency reduction by incorporating approximation in addition. The improvements made in terms of reduction in latency and merits in performance by the proposed MAC unit are showed through the synthesis done by FPGA hardware. Results show that proposed method outpaces both formerly projected MAC designs using multiplication methods for attaining high speed

NOPR

ARITHMETIC LOGIC UNIT ARCHITECTURES WITH DYNAMICALLY DEFINED PRECISION

Author: Liang Getao
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/12/2015
Field of study

Modern central processing units (CPUs) employ arithmetic logic units (ALUs) that support statically defined precisions, often adhering to industry standards. Although CPU manufacturers highly optimize their ALUs, industry standard precisions embody accuracy and performance compromises for general purpose deployment. Hence, optimizing ALU precision holds great potential for improving speed and energy efficiency. Previous research on multiple precision ALUs focused on predefined, static precisions. Little previous work addressed ALU architectures with customized, dynamically defined precision. This dissertation presents approaches for developing dynamic precision ALU architectures for both fixed-point and floating-point to enable better performance, energy efficiency, and numeric accuracy. These new architectures enable dynamically defined precision, including support for vectorization. The new architectures also prevent performance and energy loss due to applying unnecessarily high precision on computations, which often happens with statically defined standard precisions. The new ALU architectures support different precisions through the use of configurable sub-blocks, with this dissertation including demonstration implementations for floating point adder, multiply, and fused multiply-add (FMA) circuits with 4-bit sub-blocks. For these circuits, the dynamic precision ALU speed is nearly the same as traditional ALU approaches, although the dynamic precision ALU is nearly twice as large

University of Tennessee, Knoxville: Trace

Efficient Computation and FPGA implementation of Fully Homomorphic Encryption with Cloud Computing Significance

Author: Zeng Qiang
Publication venue: 'University of Windsor Leddy Library'
Publication date: 20/12/2018
Field of study

Homomorphic Encryption provides unique security solution for cloud computing. It ensures not only that data in cloud have confidentiality but also that data processing by cloud server does not compromise data privacy. The Fully Homomorphic Encryption (FHE) scheme proposed by Lopez-Alt, Tromer, and Vaikuntanathan (LTV), also known as NTRU(Nth degree truncated polynomial ring) based method, is considered one of the most important FHE methods suitable for practical implementation. In this thesis, an efficient algorithm and architecture for LTV Fully Homomorphic Encryption is proposed. Conventional linear feedback shift register (LFSR) structure is expanded and modified for performing the truncated polynomial ring multiplication in LTV scheme in parallel. Novel and efficient modular multiplier, modular adder and modular subtractor are proposed to support high speed processing of LFSR operations. In addition, a family of special moduli are selected for high speed computation of modular operations. Though the area keeps the complexity of O(Nn^2) with no advantage in circuit level. The proposed architecture effectively reduces the time complexity from O(N log N) to linear time, O(N), compared to the best existing works. An FPGA implementation of the proposed architecture for LTV FHE is achieved and demonstrated. An elaborate comparison of the existing methods and the proposed work is presented, which shows the proposed work gains significant speed up over existing works

Scholarship at UWindsor

THE DESIGN OF AN IC HALF PRECISION FLOATING POINT ARITHMETIC LOGIC UNIT

Author: Kannan Balaji
Publication venue: Clemson University Libraries
Publication date: 01/12/2009
Field of study

A 16 bit floating point (FP) Arithmetic Logic Unit (ALU) was designed and implemented in 0.35µm CMOS technology. Typical uses of the 16 bit FP ALU include graphics processors and embedded multimedia applications. The ALU of the modern microprocessors use a fused multiply add (FMA) design technique. An advantage of the FMA is to remove the need for a comparator which is required for a normal FP adder. The FMA consists of a multiplier, shifters, adders and rounding circuit. A fast multiplier based on the Wallace tree configuration was designed. The number of partial products was greatly reduced by the use of the modified booth encoder. The Wallace tree was chosen to reduce the number of reduction layers of partial products. The multiplier also involved the design of a pass transistor based 4:2 compressor. The average delay of the pass transistor based compressor was 55ps and was found to be 7 times faster than the full adder based 4:2 compressor. The shifters consist of separate left and right shifters using multiplexers. The shift amount is calculated using the exponents of the three operands. The addition operation is implemented using a carry skip adder (CSK). The average delay of the CSK was 1.05ns and was slower than the carry look ahead adder by about 400ps. The advantages of the CSK are reduced power, gate count and area when compared to the similar sized carry look ahead adder. The adder computes the addition of the multiplier result and the shifted value of the addend. In most modern computers, division is performed using software thereby eliminating the need for a separate hardware unit. FMA hardware unit was utilized to perform FP division. The FP divider uses the Newton Raphson algorithm to solve division by iteration. The initial approximated value with five bit accuracy was assumed to be pre-stored in cache memory and a separate clock cycle for cache read was assumed before the start of the FP division operation. In order to significantly reduce the area of the design, only one multiplier was used. Rounding to nearest technique was implemented using an 11 bit variable CSK adder. This is the best rounding technique when compared to other rounding techniques. In both the FMA and division, rounding was performed after the computation of the final result during the last clock cycle of operation. Testability analysis is performed for the multiplier which is the most complex and critical part of the FP ALU. The specific aim of testability was to ensure the correct operation of the multiplier and thus guarantee the correctness of the FMA circuit at the layout stage. The multiplier\u27s output was tested by identifying the minimal number of input vectors which toggle the inputs of the 4:2 compressors of the multiplier. The test vectors were identified in a semi automated manner using Perl scripting language. The multiplier was tested with a test set of thirty one vectors. The fault coverage of the multiplier was found to be 90.09%. The layout was implemented using IC station of Mentor Graphics CAD tool and resulted in a chip area of 1.96mm2. The specifications for basic arithmetic operations were met successfully. FP Division operation was completed within six clock cycles. The other arithmetic operations like FMA, FP addition, FP subtraction and FP multiplication were completed within three clock cycles

Clemson University: TigerPrints

Internet of Things Based Reconfigurable SIMD Processor for High-Speed End Devices in FPGA

Author: Muthurathinam Kavitha
Ponniah Ramadevi
Saminathan Subathradevi
Somasundaram Karthikeyan
Publication venue: Faculty of Mechanical Engineering in Slavonski Brod; Faculty of Electrical Engineering, Computer Science and Information Technology Osijek; Faculty of Civil Engineering in Osijek
Publication date: 01/01/2023
Field of study

This research article proposed the reconfigurable Single Instruction Multi Data (SIMD) processor design to speed up the accelerated computing task in IoT operations. Single Instruction Multi Data models leverage the parallel real source to speed up computing accelerated tasks. It proposes the utilization of reconfigurable Kogge Stone-dependent hybrid adder structures, now referred to as KS-CPA, in which reconfiguration occurs during the addition operation. The Least Significant Bits (LSB) are processed using a carry propagate adder, while the Most Significant Bits (MSB) are computed using the Kogge Stone adder. Depending on the data width and device-accessible energy resources, the hybrid configuration of the adder offers the 4-bit, 8-bit, and 16-bit addition. The adder form is identified by a shift in the configuration of its Carry Look-ahead and then by a Kogge Stone Adder (KSA). Throughout the activity, the KS-CLA crossbreed configuration is used to attain the fastest speed and low energy usage. The effectiveness, including its proposed hybrid adder, is evaluated by looking at the speed, energy, and area parameters, including a suitable area use during rapid applications in which both less delay and low power adders are required. Considering these, we are structuring an IoT processor that can be reconfigured to gain from SIMD. We have demonstrated that our hybrid adder-enhanced processor saves energy up to 13% and reduces 27% latency. The proposed 16 and 32-bit adders will boost time, power, and Area Delay Product (ADP) by almost 18-24% and 13-19% respectively

Hrčak - Portal of scientific journals of Croatia

Null convention logic circuits for asynchronous computer architecture

Author: Kim M
Publication venue: RMIT University
Publication date
Field of study

For most of its history, computer architecture has been able to benefit from a rapid scaling in semiconductor technology, resulting in continuous improvements to CPU design. During that period, synchronous logic has dominated because of its inherent ease of design and abundant tools. However, with the scaling of semiconductor processes into deep sub-micron and then to nano-scale dimensions, computer architecture is hitting a number of roadblocks such as high power and increased process variability. Asynchronous techniques can potentially offer many advantages compared to conventional synchronous design, including average case vs. worse case performance, robustness in the face of process and operating point variability and the ready availability of high performance, fine grained pipeline architectures. Of the many alternative approaches to asynchronous design, Null Convention Logic (NCL) has the advantage that its quasi delay-insensitive behavior makes it relatively easy to set up complex circuits without the need for exhaustive timing analysis. This thesis examines the characteristics of an NCL based asynchronous RISC-V CPU and analyses the problems with applying NCL to CPU design. While a number of university and industry groups have previously developed small 8-bit microprocessor architectures using NCL techniques, it is still unclear whether these offer any real advantages over conventional synchronous design. A key objective of this work has been to analyse the impact of larger word widths and more complex architectures on NCL CPU implementations. The research commenced by re-evaluating existing techniques for implementing NCL on programmable devices such as FPGAs. The little work that has been undertaken previously on FPGA implementations of asynchronous logic has been inconclusive and seems to indicate that asynchronous systems cannot be easily implemented in these devices. However, most of this work related to an alternative technique called bundled data, which is not well suited to FPGA implementation because of the difficulty in controlling and matching delays in a 'bundle' of signals. On the other hand, this thesis clearly shows that such applications are not only possible with NCL, but there are some distinct advantages in being able to prototype complex asynchronous systems in a field-programmable technology such as the FPGA. A large part of the value of NCL derives from its architectural level behavior, inherent pipelining, and optimization opportunities such as the merging of register and combina- tional logic functions. In this work, a number of NCL multiplier architectures have been analyzed to reveal the performance trade-offs between various non-pipelined, 1D and 2D organizations. Two-dimensional pipelining can easily be applied to regular architectures such as array multipliers in a way that is both high performance and area-efficient. It was found that the performance of 2D pipelining for small networks such as multipliers is around 260% faster than the equivalent non-pipelined design. However, the design uses 265% more transistors so the methodology is mainly of benefit where performance is strongly favored over area. A pipelined 32bit x 32bit signed Baugh-Wooley multiplier with Wallace-Tree Carry Save Adders (CSA), which is representative of a real design used for CPUs and DSPs, was used to further explore this concept as it is faster and has fewer pipeline stages compared to the normal array multiplier using Ripple-Carry adders (RCA). It was found that 1D pipelining with ripple-carry chains is an efficient implementation option but becomes less so for larger multipliers, due to the completion logic for which the delay time depends largely on the number of bits involved in the completion network. The average-case performance of ripple-carry adders was explored using random input vectors and it was observed that it offers little advantage on the smaller multiplier blocks, but this particular timing characteristic of asynchronous design styles be- comes increasingly more important as word size grows. Finally, this research has resulted in the development of the first 32-Bit asynchronous RISC-V CPU core. Called the Redback RISC, the architecture is a structure of pipeline rings composed of computational oscillations linked with flow completeness relationships. It has been written using NELL, a commercial description/synthesis tool that outputs standard Verilog. The Redback has been analysed and compared to two approximately equivalent industry standard 32-Bit synchronous RISC-V cores (PicoRV32 and Rocket) that are already fabricated and used in industry. While the NCL implementation is larger than both commercial cores it has similar performance and lower power compared to the PicoRV32. The implementation results were also compared against an existing NCL design tool flow (UNCLE), which showed how much the results of these implementation strategies differ. The Redback RISC has achieved similar level of throughput and 43% better power and 34% better energy compared to one of the synchronous cores with the same benchmark test and test condition such as input sup- ply voltage. However, it was shown that area is the biggest drawback for NCL CPU design. The core is roughly 2.5&times; larger than synchronous designs. On the other hand its area is still 2.9&times; smaller than previous designs using UNCLE tools. The area penalty is largely due to the unavoidable translation into a dual-rail topology when using the standard NCL cell library

RMIT Research Repository

Booth Algorithm with Implementation of UART Module using FPGA

Author: A. Khan Iqbal
Kong Kenneth Wong Fatt
Mahamad Abd Kadir
Saon Sharifah
Sidek Azmi
Publication venue: 'Penerbit UTHM'
Publication date: 13/02/2020
Field of study

FPGA gives high level of flexibility to the user to rapidly construct and test any hardware. It has a lot of gates which are used depending upon the hardware to be implemented. These project aims at designing Booth multiplier using VHDL for signed bit multiplication in FPGA for high speed operations, developed and implemented of UART module required to enable two-way communication between the DE-2 board and computer. It is also designed GUI interface using MATLAB for sending data and enable the output of the process result to be displayed. The Booth multiplier was implemented using the algorithm in both signed and unsigned number and the input and output of the multiplication was successfully achieved and confirmed through simulation. The GUI was implemented and tested, which UART module also performed well for transmitting and receiving of 8-bit width data. In general, the objective of this project was successfully achieved, which, the result of the component part were able to be tested

Journals of Universiti Tun Hussein Onn Malaysia (UTHM)

International Journal of Integrated Engineering

Radix-8 Booth Encoded Modulo Multiplier

Author: Anuradha K
Rupesh Kumar Penugonda
Thammishety Narasimharao
Publication venue
Publication date: 24/04/2020
Field of study

Abstract To design an efficient integrated circuit in terms of area, power and speed, has become a challenging task in modern VLSI design field. The encryption and decryption of PKC algorithms are performed by repeated modulo multiplications these multiplications differ from those encountered in signal processing and general computing applications. The Residue Number System (RNS) has emerged as a promising alternative number representation for the design of faster and low power multipliers owing to its merit to distribute a long integer multiplication into several shorter and independent modulo multiplications. The multipliers are the essential elements of the digital signal processing such as filtering, convolution, transformations and Inner products. RNS has also been successfully employed to design fault tolerant digital circuits. The modulo multiplier is usually the noncritical data path among all modulo multipliers in such high-DR RNS multiplier. This timing slack can be exploited to reduce the system area and power consumption without compromising the system performance. With this precept, a family of radix-8 Booth encoded modulo multipliers, with delay adaptable to the RNS multiplier delay, is proposed. In this paper, the radix-8 Booth encoded modulo multipliers whose delay can be tuned to match the RNS delay. In the proposed multiplier, the hard multiple is implemented using small word-length ripple carry adders (RCAs) operating in parallel. The carry-out bits from the adders are not propagated but treated as partial product bits to be accumulated in the CSA tree. The delay of the modulo multiplier can be directly controlled by the word-length of the RCAs to equal the delay of the critical modulo multiplier of the RNS. By combining radix-8 Booth encoded modulo multiplier, CSA and prefix architecture of multiplier, for high speed and low-power is achieved

CiteSeerX