## International Journal of Science Engineering and Advance Technology

# Low Latency Mac Design For Low Power DSP Applications 

Kadali Swathi ${ }^{* 1}$, M Srihari ${ }^{2}$<br>M.Tech Scholar, Department of ECE. ${ }^{1}$<br>Assist.Prof, Department of ECE, Kakinada Institute of Engineering \&Technology East Godavari Dist., AP, India. ${ }^{2}$


#### Abstract

In this work a rapid and vitality productive two-cycle duplicate gather (MAC) engineering that backings both marked and unsigned numbers is proposed. A productive MAC configuration utilizing 4:2 compressors is displayed in this idea. In this paper, a low-control rapid 4:2 compressor circuit is proposed for quick computerized math coordinated circuits. Macintosh comprises multiplier and viper units. The 4:2 compressor has been generally utilized for multiplier acknowledge. This multiplier utilizes another halfway item diminishment arrange which sequentially decreases the most extreme yield delay. This undertaking is upgraded by utilizing baugh-wooley multiplier for inertness change. Baugh-wooley multiplier does its augmentation in two's compliment shape.


Catchphrases: increase amass (MAC), compressor, Multiplier, Baugh-wooley, Low power, Low idleness.

## I. INTRODUCTION

The duplicate aggregate (MAC) unit is a typical computerized piece utilized widely in microchips and advanced flag processors for information escalated applications. For instance, many channels, orthogonal recurrence division multiplexing calculations, and channel estimators require FIR or FFT/IFFT calculations that MAC units can quicken proficiently present day purchaser gadgets make broad utilization of Digital Signal Processing (DSP) giving custom quickening agents to the spaces of sight and sound, correspondences and so forth. Run of the mill DSP applications complete an extensive number of math operations as their execution depends on computationally serious parts, for example, Fast Fourier Transform (FFT), Discrete Cosine Transform (DCT), Finite Impulse Response (FIR) channels and signs' convolution. Not surprisingly, the execution of DSP frameworks is characteristically influenced by choices on their outline in regards to the distribution and the engineering of number juggling
units. Late research exercises in the field of number juggling enhancement [1], [2] have demonstrated that the plan of number-crunching parts joining operations which share information, can prompt noteworthy execution upgrades. In light of the perception that an option can frequently be resulting to an increase (e.g., in symmetric FIR channels), the MultiplyAccumulator(MAC) and Multiply-Add (MAD) units were acquainted [3] driving with more effective usage of DSP calculations contrasted with the customary ones, which utilize just primitive assets [4]. A few structures have been proposed to upgrade the execution of the MAC operation as far as region occupation, basic way postponement or power utilization [5]- [7]. As noted in [8], MAC parts increment the adaptability of DSP information way combination as a huge arrangement of number juggling operations can be effectively mapped onto them. But the MAC/MAD operations, numerous DSP applications depend on Add-Multiply (AM) operations (e.g., FFT calculation [9]). Since the most recent decade the semiconductor business has encountered an exponential development of incorporation of modern multi-media applications into convenient devices. The real worry of compact devices is the battery life, which impacts the ongoing preparing applications and their dynamic scope of info signals for added substance highlights. It is the high time to investigate the testing criteria of these developing low power, low zone and elite advanced flag handling chips [1]. In computerized VLSI circuits, calculation is the basic part and it chooses the power utilization and working velocity of the outlines. For calculations number-crunching circuits includes adders and multipliers; which are the most bountifully utilized segments. Computerized flag processors performing sifting, convolution and so forth, depends on the proficient execution of these viper, multiplier and MAC number juggling units. As the criticality of multipliers chooses the power utilization and working velocity of the advanced circuits, there is potential at circuit
configuration level to enhance the power and postpone imperatives. Numerous specialists in the past have created and shown a few designs to enhance the productivity of the multipliers. Stall encoders and its changes were created to decrease the postponement by lessening number of lines in the Partial Product Generation stage. Compressors were used in the fractional item decrease stage to build the augmentation operation speed [3-5]. Reciprocal Pass transistor rationale based adiabatic 8 -bit multiplier is outlined in [6] to lessen the deferral and power utilization of the multiplier design. Vedic sutras were likewise utilized in the multiplier engineering to build the speed of the MAC structures [7]. To lessen the postpone encourage in the MAC models, the convey spread expansion phase of multiplier and snake phase of amass is blended utilizing compressors in this work. The direct plan of the AM unit, by first allotting a snake and afterward driving its yield to the contribution of a multiplier, increments essentially both region and basic way postponement of the circuit. Focusing on an advanced outline of AM administrators, combination procedures [10]- [13], [23] are utilized in light of the immediate recoding of the total of two numbers (proportionally a number in convey spare portrayal [14]) in its Modified Booth (MB) shape [15]. Accordingly, the convey spread (or convey lookahead) viper [16] of the ordinary AM configuration is wiped out bringing about significant additions of execution. Lyu and Matula [10] introduced a marked piece MB re-coder which changes excess parallel contributions to their MB recoding structure. An exceptional development of the preprocessing venture of the re-coder is required with a specific end goal to deal with operands in convey spare portrayal. In [12], the creator proposes a two-arrange re-coder which changes over a number in convey spare frame to its MB portrayal. The main stage changes the convey spare type of the information number into marked digit shape which is then recoded in the second stage with the goal that it coordinates the frame that the MB digits ask. As of late, the procedure of [12] has been utilized for the plan of superior adaptable coprocessor structures focusing on the computationally escalated DSP applications [17]. Zimmermann and Tran [13] show an advanced plan of [10] which brings about changes in both zone and basic way. In [23], the creators propose the recoding of a repetitive contribution from its convey spare shape to the comparing get spare frame keeping the basic way of the duplication operation settled. Despite the fact that the immediate recoding of the aggregate of two numbers in its MB shape prompts a
more productive usage of the combined Add-Multiply (FAM) unit contrasted with the ordinary one, existing recoding plans depend on complex controls in bit-level, which are executed by committed circuits in door level. This work concentrates on the productive outline of FAM administrators, focusing on the advancement of the recoding plan for coordinate molding of the MB type of the entirety of two numbers (Sum to MB - S-MB). All the more particularly, we propose another recoding system which diminishes the basic way delay and decreases region and power utilization. The proposed SMB calculation is organized.

## I. MULTIPLICATION ALGORITHM:

The multiplication algorithm for an N bit multiplicand by N bit multiplier is shown below: $\begin{array}{llll}\mathrm{Y}=\mathrm{Yn}-1 \mathrm{Yn}-2 & \ldots . . . . . . . . . . . . . . . . . . . . . . . Y 2 ~ Y 1 ~ Y 0 ~ & \text { Multiplicand } \\ \mathrm{X}=\mathrm{Xn}-1 \mathrm{Xn}-2 & . . . . . . . . . . . . . . . . ~ X 2 ~ X 1 ~ X 0 ~ & \text { Multiplier }\end{array}$








Figure 1: generalized multiplication

## II. LOW POWER COMPRESSORS:

Compressors are the digital circuits which have the capability to add five/six/seven bits at a time and hence called as column compressors. A typical five input compressor is illustrated in this brief. It takes 4 regular inputs and 1 intermediate carry-in input and generates 1 sum bit, 1 carry-out bit and another intermediate carry bit. Intermediate carry bits are the carry-in and carryouts (called as horizontal carry propagation) from previous and to next stage compressors. Carry-out (also called as vertical carry) bit is final carry generated along
with the sum bit. Since compressors forms the basic and critical components for multipliers and large-input adders, several compressors architectures were developed in the past to address several constraints. Some of the compressor architectures described in the past is shown in below figure.


Figure 2: Full Adder based Compressor
Compressor architecture shown in above figure is built using the full-adders. This architecture has only two cells and will have minimum interconnects but each of the cell needs to generate the sum and carry path and one of the path is dependent on the other. This requires larger drive strength to drive the chain of compressors and hence the power consumption will be higher. The higher drive strength will significantly have the reduced delay.

## III. MULTIPLIERS USING LOW POWER COMPRESSORS:

Multipliers are implemented in three stages namely: partial product generation, partial product reduction and carry propagate addition. Regular architectures utilize the half and full adders in the partial product stages, but due to its performance limitation compressor cells were utilized. Some of the past architecture's reduced the number of reduction steps in the partial product reduction stage by introducing booth encoding in the partial product generation stage, to reduce overall delay [3-5]. Use of compressors in the multiplier will reduce the number of gates for implementation which in turn reduces the number of interconnects. This results in reduced interconnect delay and glitches associated with-it, yielding a low power design. Thus the efficient multiplier will improve the efficiency the MAC unit. The use of circuit level design specifically designed for particular constraint will be more efficient in ASIC designs. For example the use of proposed low power compressor architecture improves the power efficiency and suits for low power
applications. To demonstrate the impact of compressor architecture a MAC unit architecture which contains more number of compressors is chosen from [2]. In [2], author has used the compressors in multipliers in the partial product reduction and in accumulation stage of the MAC unit, where the carry propagate stage of the multiplier is merged with the input of accumulate add stage.

## IV. PROPOSED BAUGHWOOLEY ARCHITECTURE:

## a) Baughwooley multiplier based MAC unit:

2's Compliments is the most famous technique in speaking to marked whole numbers in Computer sciences. It is additionally an operation of nullification (Converting positive to negative numbers or bad habit versa) in PCs which speak to negative numbers utilizing two's compliments. Its utilization is so wide today since it doesn't require the option and subtraction hardware to look at the indications of the operands to decide if to include or subtract. Two's compliment and one's compliment portrayals are normally utilized since number-crunching units are more straightforward to outline. Beneath Figure demonstrates 2's compliment and one's compliment portrayals. Baugh-Wooley Two's compliment Signed numbers: Baugh-Wooley Two's compliment Signed multipliers is the best known calculation for marked duplication since it amplifies the consistency of the multiplier and enable all the halfway items to have positive sign bits. Baugh- Wooley strategy was produced to configuration coordinate multipliers for Two's compliment numbers. While increasing two's compliment numbers specifically, each of the incomplete items to be included is a marked numbers.


Figure3: Baugh-Wooley multiplication architecture


Figure4: Block diagram for 4X4 Baugh-Wooley multiplier


Figure 5: RTL view of Baugh-Wooley with decomposition logic

The implementation of digital multiplier with decomposition logic is presented here. In this technique the multiplication process is split into smaller sub-units (smaller multipliers) and their outputs are combined to get the final result, the decomposition logic requires extra circuitry to perform the final addition of outputs attained from the smaller multiplier [7]. However, due to parallel processing, noticeable improvement in speed is achieved.

To check the performance of the multiplier structure, $8 \times 8$ multiplier structure is designed using Baugh-Wooley algorithm and the decomposition logic. Fig. 2 [7] shows an $8 \times 8$ multiplier implemented using the decomposition logic. In the first stage, four $4 \times 4$ multipliers are used to combine all the partial products, the outputs from these $4 \times 4$ multipliers are then combined in a treelike fashion to get the final results [7]. The $4 \times 4$ multiplier was implemented using BaughWooley method. For $16 \times 16$ multiplication, three
decomposition structures can be implemented. The first using $4 \times 4$ Baugh-Wooley multipliers, the second using $8 \times 8$ Baugh-Wooley multipliers and the third using $8 \times 8$ decomposition structure.


Figure 6: Decomposition structure for $8 \times 8$ multiplication
v. RESULTS


Figure 6: Experimental Result

## VI. CONCLUSION

This paper focuses on optimizing the design of multiply accumulation unit (MAC) operator. Design and domain specific low power, low latency compressor based MAC architecture has been demonstrated and importance of circuit design level and its impact for DSP applications is addressed. The proposed architectures have yielded better efficiencies in than existing architectures. More efficient Baughwooley algorithm is presented and applied to MAC unit for improved architectures.

## VII.REFERENCES

[1] Chang, Chip-Hong, Jiangmin Gu, and Mingyan Zhang. "Ultra low-voltage low-power CMOS 4-2 and 52 compressors for fast arithmetic circuits." Circuits and Systems I: Regular Papers, IEEE Transactions on 51.10 (2004): 1985-1997.
[2] Tung Thanh Hoang; Sjalander, M.; Larsson-Edefors, P., "A High-Speed, Energy-Efficient Two-Cycle Multiply Accumulate (MAC) Architecture and Its Application to a Double-Throughput MAC Unit," Circuits and Systems I: Regular Papers, IEEE Transactions on , vol.57, no.12, pp.3073,3081, Dec. 2010.
[3] Chen Ping-hua; Zhao Juan, "High-speed Parallel 32×32-b Multiplier Using a Radix-16 Booth Encoder," Intelligent Information Technology Application Workshops, 2009. IITAW '09. Third International Symposium on, vol., no., pp.406,409, 21-22 Nov. 2009
[4] Kiwon Choi; Minkyu Song, "Design of a high performance $32 \times 32$-bit multiplier with a novel sign select Booth encoder," Circuits and Systems, 2001. ISCAS 2001. The 2001 IEEE International Symposium on , vol.2, no., pp.701,704 vol. 2, 6-9 May 2001.
[5] Rajput, R.P.; Swamy, M.N.S., "High Speed Modified Booth Encoder Multiplier for Signed and Unsigned Numbers," Computer Modelling and Simulation (UKSim), 2012 UKSim 14th International Conference on , vol., no., pp.649,654, 28-30 March 2012.
[6] Yangbo Wu ; Weijiang Zhang; Jianping Hu , "Adiabatic 4-2 compressors for low-power multiplier," Circuits and Systems, 2005. 48th Midwest Symposium on , vol., no., pp.1473,1476 Vol. 2, 7-10 Aug. 2005.
[7] Jaina, D.; Sethi, K.; Panda, R., "Vedic Mathematics Based Multiply Accumulate Unit," Computational Intelligence and Communication Networks (CICN), 2011 International Conference on, vol., no., pp.754,757, 7-9 Oct. 2011.
[8] Aliparast, Peiman, Ziaadin D. Koozehkanani, and Farhad Nazari. "An Ultra High Speed Digital 4-2 Compressor in $65-\mathrm{nm}$ CMOS." International Journal of Computer Theory \& Engineering 5.4 (2013).
[9] N. Weste and David Harris, "CMOS VLSI DesignA Circuits \& System Perspective", Pearson Education, 2008.
[10] ChandraMohan U, "Low Power Area Efficient Digital Counters", Proceedings of the 7th VLSI Design and Test Workshops, VDAT, August 2003.
[11] Narendra C P \& Ravi K M Kumar, "Efficient Comparator based Sum of Absolute Differences

Architecture for Digital Image Processing Applications", Foundation of Computer Science, New York, USA, International Journal of Computer Applications, 96(4):17-24, June 2014.
[12] W.-C. Yeh, "Arithmetic Module Design and its Application to FFT," Ph.D. dissertation, Dept. Electron. Eng., National Chiao-Tung University, , Chiao-Tung, 2001.
[13] R. Zimmermann and D. Q. Tran, "Optimized synthesis of sum-of-products," in Proc. Asilomar Conf. Signals, Syst. Comput., Pacific Grove, Washington, DC, 2003, pp. 867-872.
[14] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs. Oxford: Oxford Univ. Press, 2000.
[15] O. L. Macsorley, "High-speed arithmetic in binary computers," Proc. IRE, vol. 49, no. 1, pp. 67-91, Jan. 1961.
[16] N. H. E. Weste and D. M. Harris, "Datapath subsystems," in CMOS VLSI Design: A Circuits and Systems Perspective, 4th ed. Readington: AddisonWesley, 2010, ch. 11.

