49 research outputs found

    Review of MAC Unit for Complex Numbers

    Get PDF
    In Digital Communication, Digital Signal Processor (DSP) is an important block which performs several digital signal processing applications such as Convolution, Discrete Cosine Transform (DCT), Fourier Transform, and so on. Every digital signal processor contains MAC unit. The MAC unit performs multiplication and accumulation processes repeatedly in order to perform continuous and complex operations in digital signal processing. MAC unit also contains clock and reset in order to control its operation. Many researchers have been focusing on the design of advance MAC unit architectures for complex numbers so as to achieve minimum resource utilization and delay

    Comparison of reconfigurable structures for flexible word-length multiplication

    Get PDF
    Binary multiplication continues to be one of the essential arithmetic operations in digital circuits. Even though field-programmable gate arrays (FPGAs) are becoming more and more powerful these days, the vendors cannot avoid implementing multiplications with high word-lengths using embedded blocks instead of configurable logic. But on the other hand, the circuit's efficiency decreases if the provided word-length of the hard-wired multipliers exceeds the precision requirements of the algorithm mapped into the FPGA. Thus it is beneficial to use multiplier blocks with configurable word-length, optimized for area, speed and power dissipation, e.g. regarding digital signal processing (DSP) applications. <br><br> In this contribution, we present different approaches and structures for the realization of a multiplication with variable precision and perform an objective comparison. This includes one approach based on a modified Baugh and Wooley algorithm and three structures using Booth's arithmetic operand recoding with different array structures. All modules have the option to compute signed two's complement fix-point numbers either as an individual computing unit or interconnected to a superior array. Therefore, a high throughput at low precision through parallelism, or a high precision through concatenation can be achieved

    Gate and Throughput Optimizations for NULL Convention Self-timed Digital Circuits

    Get PDF
    NULL Convention Logic (NCL) provides an asynchronous design methodology employing dual-rail signals, quad-rail signals, or other Mutually Exclusive Assertion Groups (MEAGs) to incorporate data and control information into one mixed path. In NCL, the control is inherently present with each datum, so there is no need for worse-case delay analysis and control path delay matching. This dissertation focuses on optimization methods for NCL circuits, specifically addressing three related architectural areas of NCL design. First, a design method for optimizing NCL circuits is developed. The method utilizes conventional Boolean minimization followed by table-driven gate substitutions. It is applied to design time and space optimal fundamental logic functions, a time and space optimal full adder, and time, transistor count, and power optimal up-counter circuits. The method is applicable when composing logic functions where each gate is a state-holding element; and can produce delay-insensitive circuits requiring less area and fewer gate delays than alternative gate-level approaches requiring full minterm generation. Second, a pipelining method for producing throughput optimal NCL systems is developed. A relationship between the number of gate delays per stage and the worse-case throughput for a pipeline as a whole is derived. The method then uses this relationship to minimize a pipeline\u27s worse-case throughput by partitioning the NCL combinational circuitry through the addition of asynchronous registers. The method is applied to design a maximum throughput unsigned multiplier, which yields a speedup of 2.25 over the non-pipelined version, while maintaining delay-insensitivity. Third, a technique to mitigate the impact of the NULL cycle is developed. The technique further increases the maximum attainable throughput of a NCL system by reducing inherent overheads associated with an integrated data and control path. This technique is applied to a non-pipelined 4-bit by 4-bit unsigned multiplier to yield a speedup of 1.61 over the standalone version. Finally, these techniques are applied to design a 72+32x32 multiply and accumulate (MAC) unit, which outperforms other delay-insensitive/self-timed MACs in the literature. It also performs conditional rounding, scaling, and saturation of the output, whereas the others do not; thus further distinguishing it from the previous work. The methods developed facilitate speed, transistor count, and power tradeoffs using approaches that are readily automatable

    Generic algorithms and NULL Convention Logic hardware implementation for unsigned and signed quad-rail multiplication

    Get PDF
    This thesis focuses on designing generic quad-rail arithmetic circuits, such as signed and unsigned multipliers and Multiply and Accumulate (MAC) units, using the asynchronous delay-insensitive NULL Convention Logic (NCL) paradigm. This work helps to build a library of reusable components to be used for automated NCL circuit synthesis, which will aid in the integration of asynchronous design paradigms into the semiconductor industry --Abstract, page iii

    Implementation of Area Efficient and High Bit Rate Serial-Serial Multiplier With On-the-Fly Accumulation by Asynchronous Counters

    Get PDF
    Traditional Serial-Serial multiplier addresses the high data sampling rate. It is effectively considered as the entire partial product matrix with n data sampling cycle for n×n multiplication function instead of 2n cycles in the conventional multipliers. This multiplication of partial products by considering two series inputs among which one is starting from LSB the other from MSB. Using this feed sequence and accumulation technique it takes only n cycle to complete the partial products. It achieves high bit sampling rate by replacing conventional full adder and highest 5:3 counters. Here asynchronous 1’s counter is presented. This counter takes critical path is limited to only an AND gate and D flip-flops. Accumulation is integral part of serial multiplier design. 1’s counter is used to count the number of ones at the end of the nth iteration in each counter produces. The implemented multipliers consist of a serial-serial data accumulator module and carry save adder that occupies less silicon area than the full carry save adder. In this paper we implemented model address for the 8bit 2’s complement implementing the Baugh-wooley algorithm and unsigned multiplication implementing the architecture for 8×8 Serial-Serial unsigned multiplicatio

    FPGA Implementation of Fast Binary Multiplication Based on Customized Basic Cells

    Get PDF
    Multiplication is considered one of the most time-consuming and a key operation in wide variety of embedded applications. Speeding up this operation has a significant impact on the overall performance of these applications. A vast number of multiplication approaches are found in the literature where the goal is always to achieve a higher performance. One of these approaches relies on using smaller multiplier blocks which are built based on direct Boolean algebra equations to build large multipliers. In this work, we present a methodology for designing binary multipliers where different sizes customized partial products generation (CPPG) cells are designed and used as smaller building blocks. The sizes of the designed CPPG cells are 2×2, 3×3, 4×4, 5×5, and 6×6. We use these cells to build 8×8, 16×16, 32×32, 64×64, and 128×128 binary multipliers. All of the CPPG cells and the binary multipliers are described using the VHDL language, tested, and implemented using XILINX ISE 14.6 tools targeting different FPGA families. The implementation results show that the best performance is achieved when cell 3×3 is used and Virtex-7 FPGA is targeted. The binary multipliers that are designed using the proposed CPPG cells achieve better performance when compared with the binary multipliers presented in the literature. As an application that utilizes the proposed multiplier, a Multiply-Accumulate (MAC) unit is designed and implemented in Spartan-3E. The implementation results of the MAC unit demonstrate the effectiveness of the proposed multiplier

    Pipeline par vagues d'unités arithmétiques pour la communication à très haut débit

    Get PDF

    HIGH-LEVEL OPTIMIZATION TECHNIQUES FOR LOW-POWER MODIFIED BOOTH MULTIPLIER DESIGN OF FPGA

    Get PDF
    Complex arithmetic operations are widely used in Digital Signal Processing (DSP) applications. In this work, we focus on optimizing the design of the fused Add-Multiply (FAM) Operator for increasing performance. We investigate techniques to implement the direct recoding of the sum of two numbers in its Modified Booth (MB) form. We introduce a structured and efficient recoding technique and explore three different schemes by incorporating them in FAM designs. Comparing them with the FAM designs which use existing recoding schemes, the propose technique yields considerable reductions in terms of critical delay, hardware complexity of the FAM unit. The FAM Architecture is implemented by Verilog Hardware Description Language and it is synthesized by Xilinx ISE tool. In proposed, we focus on AM units which implement the operation. The conventional design of the AM operator requires that its inputs and are first driven to an adder and then the input and the sum are driven to a multiplier in order to get. The drawback of using an adder is that it inserts a significant delay in the critical path of the AM. As there are carry signals to be propagated inside the adder, the critical path depends on the bit-width of the inputs. In order to decrease this delay, a SPST adder can be used which, however, the increases the area occupation and the power dissipation. An optimized design of the AM operator is based on the fusion of the adder and the MB encoding unit into a single data path block by direct recoding of the sum to its MB representation. The fused Add-Multiply (FAM) component contains only one adder at the end (final adder of the parallel multiplier). As a result, significant area savings are observed and the critical path delay of the recoding process is reduced and decoupled from the bit-width of its inputs. In this work, we present a new technique for direct recoding of two numbers in the MB representation of their sum

    DESIGN OF THE OPTIMIZED FUSED-ADD MULTIPLY (FAM) OPERATOR BY USING MODIFIED BOOTH

    Get PDF
    Complex arithmetic operations are widely used in Digital Signal Processing (DSP) applications. In this work, we focus on optimizing the design of the fused Add-Multiply (FAM) Operator for increasing performance. We investigate techniques to implement the direct recoding of the sum of two numbers in its Modified Booth (MB) form. We introduce a structured and efficient recoding technique and explore three different schemes by incorporating them in FAM designs. Comparing them with the FAM designs which use existing recoding schemes, the propose technique yields considerable reductions in terms of critical delay, hardware complexity of the FAM unit. The FAM Architecture is implemented by Verilog Hardware Description Language and it is synthesized by Xilinx ISE tool. In proposed, we focus on AM units which implement the operation. The conventional design of the AM operator requires that its inputs and are first driven to an adder and then the input and the sum are driven to a multiplier in order to get. The drawback of using an adder is that it inserts a significant delay in the critical path of the AM. As there are carry signals to be propagated inside the adder, the critical path depends on the bit-width of the inputs. In order to decrease this delay, a SPST adder can be used which, however, the increases the area occupation and the power dissipation. An optimized design of the AM operator is based on the fusion of the adder and the MB encoding unit into a single data path block by direct recoding of the sum to its MB representation. The fused Add-Multiply (FAM) component contains only one adder at the end (final adder of the parallel multiplier). As a result, significant area savings are observed and the critical path delay of the recoding process is reduced and decoupled from the bit-width of its inputs. In this work, we present a new technique for direct recoding of two numbers in the MB representation of their sum
    corecore