641 research outputs found

    To Develop and Implement Low Power, High Speed VLSI for Processing Signals using Multirate Techniques

    Get PDF
    Multirate technique is necessary for systems with different input and output sampling rates. Recent advances in mobile computing and communication applications demand low power and high speed VLSI DSP systems [4]. This Paper presents Multirate modules used for filtering to provide signal processing in wireless communication system. Many architecture developed for the design of low complexity, bit parallel Multiple Constant Multiplications operation which dominates the complexity of DSP systems. However, major drawbacks of present approaches are either too costly or not efficient enough. On the other hand, MCM and digit-serial adder offer alternative low complexity designs, since digit-serial architecture occupy less area and are independent of the data word length [1][10]. Multiple Constant Multiplications is efficient way to reduce the number of addition and subtraction in polyphase filter implementation. This Multirate design methodology is systematic and applicable to many problems. In this paper, attention has given to the MCM & digit serial architecture with shifting and adding techniques that offers alternative low complexity in operations. This paper also focused on Multirate Signal Processing Modules using Voltage and Technology scaling. Reduction of power consumption is important for VLSI system and also it becomes one of the most critical design parameter. Transistorized Multirate module which has full custom design with different circuit topology and optimization level simulated on cadence platform. Multirate modules are used AMI 0.6 um, TSMC 0.35 um, and TSMC 0.25 um technologies for different voltage scaling. The presented methodology provides a systematic way to derive circuit technique for high speed operation at a low supply voltage. Multirate polyphase interpolator and decimator are also designed and optimized at architectural level in order to analyze the terms power consumption, area and speed. DOI: 10.17762/ijritcc2321-8169.150314

    Evolutionary design of digital VLSI hardware

    Get PDF

    A high-speed integrated circuit with applications to RSA Cryptography

    Get PDF
    Merged with duplicate record 10026.1/833 on 01.02.2017 by CS (TIS)The rapid growth in the use of computers and networks in government, commercial and private communications systems has led to an increasing need for these systems to be secure against unauthorised access and eavesdropping. To this end, modern computer security systems employ public-key ciphers, of which probably the most well known is the RSA ciphersystem, to provide both secrecy and authentication facilities. The basic RSA cryptographic operation is a modular exponentiation where the modulus and exponent are integers typically greater than 500 bits long. Therefore, to obtain reasonable encryption rates using the RSA cipher requires that it be implemented in hardware. This thesis presents the design of a high-performance VLSI device, called the WHiSpER chip, that can perform the modular exponentiations required by the RSA cryptosystem for moduli and exponents up to 506 bits long. The design has an expected throughput in excess of 64kbit/s making it attractive for use both as a general RSA processor within the security function provider of a security system, and for direct use on moderate-speed public communication networks such as ISDN. The thesis investigates the low-level techniques used for implementing high-speed arithmetic hardware in general, and reviews the methods used by designers of existing modular multiplication/exponentiation circuits with respect to circuit speed and efficiency. A new modular multiplication algorithm, MMDDAMMM, based on Montgomery arithmetic, together with an efficient multiplier architecture, are proposed that remove the speed bottleneck of previous designs. Finally, the implementation of the new algorithm and architecture within the WHiSpER chip is detailed, along with a discussion of the application of the chip to ciphering and key generation

    ์˜จ-๋””๋ฐ”์ด์Šค ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ ๊ฐ€์†๊ธฐ๋ฅผ ์œ„ํ•œ ๊ณ ์„ฑ๋Šฅ ์—ฐ์‚ฐ ์œ ๋‹› ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ๊น€ํƒœํ™˜.Optimizing computing units for an on-device neural network accelerator can bring less energy and latency, more throughput, and might enable unprecedented new applications. This dissertation studies on two specific optimization opportunities of multiplyaccumulate (MAC) unit for on-device neural network accelerator stem from precision quantization methodology. Firstly, we propose an enhanced MAC processing unit structure efficiently processing mixed-precision model with majority operations with low precision. Precisely, two essential works are: (1) MAC unit structure supporting two precision modes is designed for fully utilizing its computation logic when processing lower precision data, which brings more computation efficiency for mixed-precision models whose major operations are in lower precision; (2) for a set of input CNNs, we formulate the exploration of the size of a single internal multiplier in MAC unit to derive an economical instance, in terms of computation and energy cost, of MAC unit structure across the whole network layers. Experimental results with two well-known CNN models, AlexNet and VGG-16, and two experimental precision settings showed that proposed units can reduce computational cost per multiplication by 4.68โˆผ30.3% and save energy cost by 43.3% on average over conventional units. Secondly, we propose an acceleration technique for processing multiplication operations using stochastic computing (SC). MUX-FSM based SC, which employs a MUX controlled by an FSM to generate a bit sequence of a binary number to count up for a MAC operation, considerably reduces the hardware cost for implementing MAC operations over the traditional stochastic number generator (SNG) based SC. Nevertheless, the existing MUX-FSM based SC still does not meet the multiplication processing time required for a wide adoption of on-device neural networks in practice even though it offers a very economical hardware implementation. Also, conventional enhancements have their limitation for sub-maximal cycle reduction, parameter conversion cost, etc. This work proposes a solution to the problem of further speeding up the conventional MUX-FSM based SC. Precisely, we analyze the bit counting pattern produced by MUX-FSM and replace the counting redundancy by shift operation, resulting in reducing the length of the required bit sequence significantly, theoretically speeding up the worst-case multiplication processing time by 2X or more. Through experiments, it is shown that our enhanced SC technique is able to shorten the average processing time by 38.8% over the conventional MUX-FSM based SC.์˜จ-๋””๋ฐ”์ด์Šค ์ธ๊ณต ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ ๊ฐ€์†๊ธฐ๋ฅผ ์œ„ํ•œ ์—ฐ์‚ฐ ํšŒ๋กœ ์ตœ์ ํ™”๋Š” ์ €์ „๋ ฅ, ์ €์ง€์—ฐ์‹œ๊ฐ„, ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰, ๊ทธ๋ฆฌ๊ณ  ์ด์ „์— ๋ถˆ๊ฐ€ํ•˜์˜€๋˜ ์ƒˆ๋กœ์šด ์‘์šฉ์„ ๊ฐ€๋Šฅ์ผ€ ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์˜จ-๋””๋ฐ”์ด์Šค ์ธ๊ณต ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ ๊ฐ€์†๊ธฐ์˜ ๊ณฑ์…ˆ-๋ˆ„์ ํ•ฉ ์—ฐ์‚ฐ๊ธฐ(MAC)์— ๋Œ€ํ•ด ์ •๋ฐ€๋„ ์–‘์žํ™” ๊ธฐ๋ฒ• ์ ์šฉ ๊ณผ์ •์—์„œ ํŒŒ์ƒํ•œ ๋‘ ๊ฐ€์ง€ ํŠน์ •ํ•œ ์ตœ์ ํ™” ๋ฌธ์ œ์— ๋Œ€ํ•ด ๋…ผ์˜ํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ, ๋‚ฎ์€ ์ •๋ฐ€๋„ ์—ฐ์‚ฐ์ด ๋Œ€๋‹ค์ˆ˜๋ฅผ ์ฐจ์ง€ํ•˜๋„๋ก ์ค€๋น„๋œ ๋‹ค์ค‘ ์ •๋ฐ€๋„๊ฐ€ ์ ์šฉ๋œ ๋ชจ๋ธ์„ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ๊ฐœ์„ ๋œ MAC ์—ฐ์‚ฐ ์œ ๋‹› ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ๋‹ค์Œ ๋‘ ๊ฐ€์ง€ ๊ธฐ์—ฌ์ ์„ ์ œ์•ˆํ•œ๋‹ค: (1) ์ œ์•ˆํ•œ ๋‘ ๊ฐ€์ง€ ์ •๋ฐ€๋„ ๋ชจ๋“œ๋ฅผ ์ง€์›ํ•˜๋Š” MAC ์œ ๋‹› ๊ตฌ์กฐ๋Š” ๋‚ฎ์€ ์ •๋ฐ€๋„ ๋ฐ์ดํ„ฐ๋ฅผ ์—ฐ์‚ฐํ•  ๋•Œ ์œ ๋‹›์˜ ์—ฐ์‚ฐ ํšŒ๋กœ๋ฅผ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•˜๋„๋ก ์„ค๊ณ„๋˜๋ฉฐ, ๋‚ฎ์€ ์ •๋ฐ€๋„ ์—ฐ์‚ฐ ๋น„์œจ์ด ๋Œ€๋‹ค์ˆ˜๋ฅผ ์ฐจ์ง€ํ•˜๋Š” ๋‹ค์ค‘ ์ •๋ฐ€๋„ ์—ฐ์‚ฐ ๋ชจ๋ธ์— ๋” ๋†’์€ ์—ฐ์‚ฐ ํšจ์œจ์„ ์ œ๊ณตํ•œ๋‹ค; (2) ์—ฐ์‚ฐ ๋Œ€์ƒ CNN ๋„คํŠธ์›Œํฌ์— ๋Œ€ํ•ด, MAC ์œ ๋‹›์˜ ๋‚ด๋ถ€ ๊ณฑ์…ˆ๊ธฐ์˜ `๊ฒฝ์ œ์ ์ธ' (๋น„ํŠธ) ํฌ๊ธฐ๋ฅผ ํƒ์ƒ‰ํ•˜๊ธฐ ์œ„ํ•œ ๋น„์šฉ ํ•จ์ˆ˜๋ฅผ, ์ „์ฒด ๋„คํŠธ์›Œํฌ ๋ ˆ์ด์–ด๋ฅผ ์—ฐ์‚ฐ ๋Œ€์ƒ์œผ๋กœ ํ•˜์—ฌ ์—ฐ์‚ฐ ๋น„์šฉ๊ณผ ์—๋„ˆ์ง€ ๋น„์šฉ ํ•ญ์œผ๋กœ ๋‚˜ํƒ€๋ƒˆ๋‹ค. ๋„๋ฆฌ ์•Œ๋ ค์ง„ AlexNet๊ณผ VGG-16 CNN ๋ชจ๋ธ์— ๋Œ€ํ•˜์—ฌ, ๊ทธ๋ฆฌ๊ณ  ๋‘ ๊ฐ€์ง€ ์‹คํ—˜ ์ƒ ์ •๋ฐ€๋„ ๊ตฌ์„ฑ์— ๋Œ€ํ•˜์—ฌ, ์‹คํ—˜ ๊ฒฐ๊ณผ ์ œ์•ˆํ•œ ์œ ๋‹›์ด ๊ธฐ์กด ์œ ๋‹› ๋Œ€๋น„ ๋‹จ์œ„ ๊ณฑ์…ˆ๋‹น ์—ฐ์‚ฐ ๋น„์šฉ์„ 4.68~30.3% ์ ˆ๊ฐํ•˜์˜€์œผ๋ฉฐ ์—๋„ˆ์ง€ ๋น„์šฉ์„ 43.3% ์ ˆ๊ฐํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ์Šคํ† ์บ์Šคํ‹ฑ ์ปดํ“จํŒ… (SC) ๊ธฐ๋ฐ˜ MAC ์—ฐ์‚ฐ ์œ ๋‹›์˜ ์—ฐ์‚ฐ ์‚ฌ์ดํด ์ ˆ๊ฐ์„ ์œ„ํ•œ ๊ธฐ๋ฒ• ๋ฐ ์—ฐ๊ด€๋œ ํ•˜๋“œ์›จ์–ด ์œ ๋‹› ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. FSM์œผ๋กœ ์ œ์–ด๋˜๋Š” MUX๋ฅผ ํ†ตํ•ด ์ž…๋ ฅ ์ด์ง„์ˆ˜์—์„œ ๋งŒ๋“  ๋น„ํŠธ ์ˆ˜์—ด์„ ์„ธ์–ด MAC ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•˜๋Š” MUX-FSM ๊ธฐ๋ฐ˜ SC๋Š” ๊ธฐ์กด ์Šคํ† ์บ์Šคํ‹ฑ ์ˆซ์ž ์ƒ์„ฑ๊ธฐ ๊ธฐ๋ฐ˜ SC ๋Œ€๋น„ ํ•˜๋“œ์›จ์–ด ๋น„์šฉ์„ ์ƒ๋‹นํžˆ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ˜„์žฌ MUX-FSM ๊ธฐ๋ฐ˜ SC๋Š” ํšจ์œจ์ ์ธ ํ•˜๋“œ์›จ์–ด ๊ตฌํ˜„๊ณผ ๋ณ„๊ฐœ๋กœ ์—ฌ์ „ํžˆ ๋‹ค์ˆ˜์˜ ์—ฐ์‚ฐ ์‚ฌ์ดํด์„ ์š”๊ตฌํ•˜์—ฌ ์˜จ-๋””๋ฐ”์ด์Šค ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ๊ธฐ์— ์ ์šฉ๋˜๊ธฐ ์–ด๋ ค์› ๋‹ค. ๋˜ํ•œ, ๊ธฐ์กด์— ์ œ์•ˆ๋œ ๋Œ€์•ˆ์€ ์ œ๊ฐ๊ธฐ ์ ˆ๊ฐ ํšจ๊ณผ์— ํ•œ๊ณ„๊ฐ€ ์žˆ๊ฑฐ๋‚˜ ๋ชจ๋ธ ๋ณ€์ˆ˜ ๋ณ€ํ™˜ ๋น„์šฉ์ด ์žˆ๋Š” ๋“ฑ ํ•œ๊ณ„์ ์ด ์žˆ์—ˆ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๊ธฐ์กด MUX-FSM ๊ธฐ๋ฐ˜ SC์˜ ์ถ”๊ฐ€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•œ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. MUX-FSM ๊ธฐ๋ฐ˜ SC์˜ ๋น„ํŠธ ์ง‘๊ณ„ ํŒจํ„ด์„ ํŒŒ์•…ํ•˜๊ณ , ์ค‘๋ณต ์ง‘๊ณ„๋ฅผ ์‹œํ”„ํŠธ ์—ฐ์‚ฐ์œผ๋กœ ๊ต์ฒดํ•˜์˜€๋‹ค. ์ด๋กœ๋ถ€ํ„ฐ ํ•„์š” ๋น„ํŠธ ํŒจํ„ด์˜ ๊ธธ์ด๋ฅผ ํฌ๊ฒŒ ์ค„์ด๋ฉฐ, ๊ณฑ์…ˆ ์—ฐ์‚ฐ ์ค‘ ์ตœ์•…์˜ ๊ฒฝ์šฐ์˜ ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์„ ์ด๋ก ์ ์œผ๋กœ 2๋ฐฐ ์ด์ƒ ํ–ฅ์ƒํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ์—์„œ ์ œ์•ˆํ•œ ๊ฐœ์„ ๋œ SC ๊ธฐ๋ฒ•์ด ๊ธฐ์กดMUX-FSM ๊ธฐ๋ฐ˜ SC ๋Œ€๋น„ ํ‰๊ท  ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์„ 38.8% ์ค„์ผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.1 INTRODUCTION 1 1.1 Neural network accelerator and its optimizations 1 1.2 Necessity of optimizing computational block of neural network accelerator 5 1.3 Contributions of This Dissertation 7 2 MAC Design Considering Mixed Precision 9 2.1 Motivation 9 2.2 Internal Multiplier Size Determination 14 2.3 Proposed hardware structure 16 2.4 Experiments 21 2.4.1 Implementation of Reference MAC units 23 2.4.2 Area, Wirelength, Power, Energy, and Performance of MAC units for AlexNet 24 2.4.3 Area, Wirelength, Power, Energy, and Performance of MAC units for VGG-16 31 2.4.4 Power Saving by Clock Gating 35 3 Speeding up MUX-FSM based Stochastic Computing Unit Design 37 3.1 Motivations 37 3.1.1 MUX-FSM based SC and previous enhancements 42 3.2 The Proposed MUX-FSM based SC 48 3.2.1 Refined Algorithm for Stochastic Computing 48 3.3 The Supporting Hardware Architecture 55 3.3.1 Bit Counter with shift operation 55 3.3.2 Controller 57 3.3.3 Combining with preceding architectures 58 3.4 Experiments 59 3.4.1 Experiments Setup 59 3.4.2 Generating input bit selection pattern 60 3.4.3 Performance Comparison 61 3.4.4 Hardware Area and Energy Comparison 63 4 CONCLUSIONS 67 4.1 MAC Design Considering Mixed Precision 67 4.2 Speeding up MUX-FSM based Stochastic Computing Unit Design 68 Abstract (In Korean) 73Docto

    Versatile Montgomery Multiplier Architectures

    Get PDF
    Several algorithms for Public Key Cryptography (PKC), such as RSA, Diffie-Hellman, and Elliptic Curve Cryptography, require modular multiplication of very large operands (sizes from 160 to 4096 bits) as their core arithmetic operation. To perform this operation reasonably fast, general purpose processors are not always the best choice. This is why specialized hardware, in the form of cryptographic co-processors, become more attractive. Based upon the analysis of recent publications on hardware design for modular multiplication, this M.S. thesis presents a new architecture that is scalable with respect to word size and pipelining depth. To our knowledge, this is the first time a word based algorithm for Montgomery\u27s method is realized using high-radix bit-parallel multipliers working with two different types of finite fields (unified architecture for GF(p) and GF(2n)). Previous approaches have relied mostly on bit serial multiplication in combination with massive pipelining, or Radix-8 multiplication with the limitation to a single type of finite field. Our approach is centered around the notion that the optimal delay in bit-parallel multipliers grows with logarithmic complexity with respect to the operand size n, O(log3/2 n), while the delay of bit serial implementations grows with linear complexity O(n). Our design has been implemented in VHDL, simulated and synthesized in 0.5ฮผ CMOS technology. The synthesized net list has been verified in back-annotated timing simulations and analyzed in terms of performance and area consumption

    Multiplierless CSD techniques for high performance FPGA implementation of digital filters.

    Get PDF
    I leverage FastCSD to develop a new, high performance iterative multiplierless structure based on a novel real-time CSD recoding, so that more zero partial products are introduced. Up to 66.7% zero partial products occur compared to 50% in the traditional modified Booth's recoding. Also, this structure reduces the non-zero partial products to a minimum. As a result, the number of arithmetic operations in the carry-save structure is reduced. Thus, an overall speed-up, as well as low-power consumption can be achieved. Furthermore, because the proposed structure involves real time CSD recoding and does not require a fixed value for the multiplier input to be known a priori, the proposed multiplier can be applied to implement digital filters with non-fixed filter coefficients, such as adaptive filters.My work is based on a dramatic new technique for converting between 2's complement and CSD number systems, and results in high-performance structures that are particularly effective for implementing adaptive systems in reconfigurable logic.My research focus is on two key ideas for improving DSP performance: (1) Develop new high performance, efficient shift-add techniques ("multiplierless") to implement the multiply-add operations without the need for a traditional multiplier structure. (2) There is a growing trend toward design prototyping and even production in FPGAs as opposed to dedicated DSP processors or ASICs; leverage this trend synergistically with the new multiplierless structures to improve performance.Implementation of digital signal processing (DSP) algorithms in hardware, such as field programmable gate arrays (FPGAs), requires a large number of multipliers. Fast, low area multiply-adds have become critical in modern commercial and military DSP applications. In many contemporary real-time DSP and multimedia applications, system performance is severely impacted by the limitations of currently available speed, energy efficiency, and area requirement of an onboard silicon multiplier.I also introduce a new multi-input Canonical Signed Digit (CSD) multiplier unit, which requires fewer shift/add/subtract operations and reduced CSD number conversion overhead compared to existing techniques. This results in reduced power consumption and area requirements in the hardware implementation of DSP algorithms. Furthermore, because all the products are produced simultaneously, the multiplication speed and thus the throughput are improved. The multi-input multiplier unit is applied to implement digital filters with non-fixed filter coefficients, such as adaptive filters. The implementation cost of these digital filters can be further reduced by limiting the wordlength of the input signal with little or no sacrifice to the filter performance, which is confirmed by my simulation results. The proposed multiplier unit can also be applied to other DSP algorithms, such as digital filter banks or matrix and vector multiplications.Finally, the tradeoff between filter order and coefficient length in the design and implementation of high-performance filters in Field Programmable Gate Arrays (FPGAs) is discussed. Non-minimum order FIR filters are designed for implementation using Canonical Signed Digit (CSD) multiplierless implementation techniques. By increasing the filter order, the length of the coefficients can be decreased without reducing the filter performance. Thus, an overall hardware savings can be achieved.Adaptive system implementations require real-time conversion of coefficients to Canonical Signed Digit (CSD) or similar representations to benefit from multiplierless techniques for implementing filters. Multiplierless approaches are used to reduce the hardware and increase the throughput. This dissertation introduces the first non-iterative hardware algorithm to convert 2's complement numbers to their CSD representations (FastCSD) using a fixed number of shift and logic operations. As a result, the power consumption and area requirements required for hardware implementation of DSP algorithms in which the coefficients are not known a priori can be greatly reduced. Because all CSD digits are produced simultaneously, the conversion speed and thus the throughput are improved when compared to overlap-and-scan techniques such as Booth's recoding

    ACCELERATION OF SPARSE MATRIX MULTIPLICATION USING BIT-SERIAL ARITHMETIC

    Get PDF
    Machine Learning inference requires the multiplication of large, sparse matrices. We argue that direct spatial implementation of these fixed matrices minimizes the work per- formed in the computation, and allows for significant reduction in latency and power through constant propagation and logic minimization. Bit-serial arithmetic enables massive static matrices to be implemented. We present the structure of our bit-serial matrix multiplier, and evaluate using canonical signed digit representation to further reduce logic utilization. We have implemented these matrices on a large FPGA and provide a cost model that is simple and extensible. These FPGA implementations, on average, reduce latency by 50x up to 86x versus GPU libraries. Comparing against a recent sparse DNN accelerator, we measure a 4.1x to 47x reduction in latency depending on matrix dimension and sparsity. Throughput of the FPGA solution is also competitive for a wide range of matrix dimensions and batch sizes. Finally, we discuss ways these techniques could be deployed in ASICs, making them applicable for dynamic sparse matrix computations.M.S

    HIGH-SPEED CO-PROCESSORS BASED ON REDUNDANT NUMBER SYSTEMS

    Get PDF
    There is a growing demand for high-speed arithmetic co-processors for use in applications with computationally intensive tasks. For instance, Fast Fourier Transform (FFT) co-processors are used in real-time multimedia services and financial applications use decimal co-processors to perform large amounts of decimal computations. Using redundant number systems to eliminate word-wide carry propagation within interim operations is a well-known technique to increase the speed of arithmetic hardware units. Redundant number systems are mostly useful in applications where many consecutive arithmetic operations are performed prior to the final result, making it advantageous for arithmetic co-processors. This thesis discusses the implementation of two popular arithmetic co-processors based on redundant number systems: namely, the binary FFT co-processor and the decimal arithmetic co-processor. FFT co-processors consist of several consecutive multipliers and adders over complex numbers. FFT architectures are implemented based on fixed-point and floating-point arithmetic. The main advantage of floating-point over fixed-point arithmetic is the wide dynamic range it introduces. Moreover, it avoids numerical issues such as scaling and overflow/underflow concerns at the expense of higher cost. Furthermore, floating-point implementation allows for an FFT co-processor to collaborate with general purpose processors. This offloads computationally intensive tasks from the primary processor. The first part of this thesis, which is devoted to FFT co-processors, proposes a new FFT architecture that uses a new Binary-Signed Digit (BSD) carry-limited adder, a new floating-point BSD multiplier and a new floating-point BSD three-operand adder. Finally, a new unit labeled as Fused-Dot-Product-Add (FDPA) is designed to compute AB+CD+E over floating-point BSD operands. The second part of the thesis discusses decimal arithmetic operations implemented in hardware using redundant number systems. These operations are popularly used in decimal floating-point co-processors. A new signed-digit decimal adder is proposed along with a sequential decimal multiplier that uses redundant number systems to increase the operational frequency of the multiplier. New redundant decimal division and square-root units are also proposed. The architectures proposed in this thesis were all implemented using Hardware-Description-Language (Verilog) and synthesized using Synopsys Design Compiler. The evaluation results prove the speed improvement of the new arithmetic units over previous pertinent works. Consequently, the FFT and decimal co-processors designed in this thesis work with at least 10% higher speed than that of previous works. These architectures are meant to fulfill the demand for the high-speed co-processors required in various applications such as multimedia services and financial computations
    • โ€ฆ
    corecore