ABSTRACT
1.INTRODUCTION
To develop low power, high speed and area efficient portable electronic design is a very challenging problem for the hardware designers in the current scenario [1] . Mobile phones, smart cards, assistive listening technology such as hearing aids and PDAs are the example of portable consumer electronic products [2, 3] . The main concerns of these products are not only to extend the operating hours of the battery residing in it but also great computational capacity. Low power design can be develop at system level, technology level, architecture level and circuit level. A larger amount of power can be saved if low power design is achieved at system level. A significant amount of power consumption can be reduced at architecture level but at the cost of delay penalty and area overhead. At architecture and system level, parallelism and pipelining are two main techniques used to reduce power and propagation delay [4] . At technology level, power consumption is going to scale down at the same time as the technology is shrinking day by day. Thus, power saving can be achieved by the improvement in fabrication process such as small feature size, very low voltages, interconnects and insulator with low dielectric constants. At circuit level voltage scaling, threshold voltage, Transistor sizing, network restructuring power down strategies and logic style are used to achieve low power [5] . In addition to this, this technique also contributes in the reduction of propagation delay and area occupancy as well.
Digital signal processing (DSP) is a important unit in electronic devices. Digital Signal Processors (DSPs) are used to perform the common operations such as video processing, filtering and fast fourier transform (FFT). Such modules perform extensive sequence of multiply and accumulate computations. Multiplication is most fundamental operation in digital computer systems and digital signal processors [6] [7] [8] [9] . A large number of transistors with high switching transitions is used to perform variety of multiplication operations. Multiplier consumes 30% power and also occupies 46% chip area in 64 point radix-4 pipelined FFT processor. Therefore, multiplier is most critical, power hungry arithmetic unit that requires more area and computational time [10] [11] [12] [13] [14] [15] . Various techniques are applier externally and internally in the past, to achieve energy efficient multiplier designs. External techniques are related to the input data characteristics, whereas an internal technique deals with the system, technology, architecture and circuit level [6] . In literature, different tree based multipliers (Wallace and Dadda) and array based multipliers are discussed extensively [16] [17] [18] [19] . Array based multipliers consumes low power as compared to Wallace tree multipliers. In tree based multiplier, additional hardware is require to improve the performance, but at the cost of increased layout and parasitic. On the other side, array multiplier has smaller and regular layout. Therefore, array multiplier is a better choice due to its lower power consumption, smaller layout and relatively good performance [20] [21] [22] [23] . Adder is a fundamental unit of the multiplier, thus it has significant impact on the overall performance of the system in terms of power dissipation, delay and area occupancy.
In this paper, array multiplier is proposed to achieve low power and high speed multiplication operation with lesser hardware cost. This multiplier adopts improved column bypassing scheme and new adder architecture for better overall performance. The proposed adder architecture is optimized with lesser hardware as small area leads to less switching transitions.
The rest of the paper is organized as follows. Section 2 presents a short introduction to the sources of power dissipation. Section 3 reviews the various multipliers. Section 4 describes the proposed multiplier. Results and analysis of the entire work are presented in section 5. Finally, Section 6 concludes the paper.
2.POWER DISSIPATION
The sources of power consumption in digital CMOS circuits are static power dissipation and dynamic power [24] [25] [26] . Eq. (1) shows power consumption of digital CMOS circuits [27] .
Where is switching transition of a clock cycle, is the output capacitance, DD V is the supply voltage and f is the switching frequency which is fixed in many DSP and dedicated applications,
Isc is the short circuit current, and leakage I is the leakage current [28] . In the submicron technology, leakage current is a significant contributor of power consumption. Circuit and technology level techniques used dual V t partitioning, multi-threshold CMOS and power gating approach to reduce the leakage power. Some leakage reduction methods have been presented in [29, 30] . Power gating approach can reduce both the components of power dissipation up to a good extent [31] [32] [33] . In this paper, power gating approach is consider for the power reduction of proposed multiplier. This technique uses sleep transistors to shut down the supply of the selective logic blocks which are not functional during bypassing operation. PMOS transistor acts as a header switch to connect the supply voltage V DD to logic block and NMOS acts as a footer switch to connect the ground to the logic block [34] as shown in figure 1 . The proposed multiplier utilizes power gating approach and used PMOS header switch in place of tri state buffer which are used by the previously discussed multipliers. In an active circuit, dynamic power dissipation is the major source of power dissipation where leakage power is less. Dynamic power can be lowered by reducing the switching transitions of the design without affecting its functionality. Most of the power reduction techniques apply on the multiplier target optimization of parameters involved in equation (1), Logic style and optimized architectures are also used to reduce the power consumption. In case of multiplier, dynamic power can be reduced quarterly by reducing the supply voltage but makes the module sluggish [35] . This further reduces the throughput since the delay is inversely proportional to the supply voltage as shown in eq. (2) ( ) 
where is threshold voltage of the transistor, is the load capacitance. In the short channel device, the value of is 1.3 in the above equation. It varies according to the technology and assumed to be fixed. In this paper, we focus on reducing both dynamic power dissipation and static power dissipation with architecture level and circuit level modifications.
3.REVIEW ON MULTIPLIER ARCHITECTURES
Multiplication is the basic operation performed by many common DSP functional unit such as FIR filters and FFT modules. Reduction in the power consumption of the multiplier can reduce a significant portion of the power in the overall digital system [36] . The multiplication of n bit wide numbers A and B is defined as follows [37] . 
3.1.Conventional Array Multiplier
Array multiplier is a better choice in DSP applications due to its smaller layout and high throughput. It is based on standard add and shift operations. Its structure is organized by several stages of AND gates and full adder cells. It may consist of either ripple carry adders (RCAs) and carry save adders (CSAs) [38] . For N(x)N multiplication RCAs based multiplier needs 3N adders and takes 2N+1 adders delay in the worst case. However CSAs based multiplier needs 3N adders to perform multiplication but takes N + 2 adders delay in the worst case. In CSA based multiplier, carry has to be propagate from (j-1) th row to j th row and then (j+1) th row. The CSA based parallel array multiplier is also known as braun Multiplier [39] The limitations of the braun multiplier is its logical architecture that leads to more power consumption and hardware cost. Power reduction can be achieved through architectural modification via row bypassing, column bypassing, row and column bypassing and circuit level modification. Based on the concept of improved column bypassing with new adder architecture, a low power and high speed multiplier is proposed with lesser hardware cost.
3.2.Column Bypassing Multiplier
Column bypassing multiplier eliminates the extra correcting circuit to skip the full adder cell and also consumes lesser power than braun multiplier at higher frequency of operation. This multiplier consists of rows of carry save adders. The major focus of this multiplier is to reduce the switching transitions required to perform the computations. The adder cell is shown in figure 4 . Tri-state buffers at the input of the adder cells are inserted for reducing the switching transitions if these cells are bypassed. Whereas, Multiplexer is inserted to select the sum output under no bypassing condition or when the bypassing is used as shown in figure 5 . The addition operation in (i-1) th column can be bypassed to (i) th if the corresponding bit in the multiplicand is zero. This operation is performed by disabling adder with buffer under the control of multiplier bit a i [40, 41] . The main limitations of this multiplier are its extra hardware cost and power consumption because of buffers, full adder cells and additional AND gates inserted in the last row of adder cells. While simulating, it was observed that this multiplier also dissipate large amount of power than conventional array multiplier due to the buffers if operating at lower frequencies.
3.3.Row Bypassing Multiplier
Row bypassing multiplier consumes lesser power than braun multiplier at higher frequency of operation. It consists of the rows of the ripple carry based full adder cells. The adder cell is shown in figure 5 . Tri-state buffers at the input of the adder cells are inserted for reducing the switching transitions, if these cells are bypassed. Whereas, multiplexer is inserted to select the sum output under no bypassing condition or when the bypassing is used as shown in figure 7 . The (j-1) th row of adders are bypassed to (j) th row if the corresponding bit in the multiplier is zero. This operation is performed by disabling the adder with tri state buffer under the control of multiplier bit b j . Buffers and multiplexers are designed with transmission gates [12] . The limitation of this approach is that it consumes more power and also has extra hardware due to the use of buffers and full adder cells. While simulating, it was also observed that this multiplier also dissipate larger amount of power than the conventional array multiplier due to the buffers if operating at lower frequencies.
3.4.Row and Column Bypassing Multiplier
This multiplier consumes lesser amount of power and lesser hardware than the previously discussed multipliers. The (j-1) th rows of the multiplier is bypassed under the control of AND gate (a i b j ). When the output of gate (a i b j ) is 1, the addition operation is performed by an inverter and the carry output will be equal to the input of the inverter as shown in figure 8(a). If (a i b j ) is 0, the inverter is disabled with buffer and its input is bypassed to the sum output. Carry out will be zero because both remaining operand and carry-in is zero. This operation is applicable in the first row of the CSA based adder cell as the C in is always zero. The (j) th rows of this multiplier is bypassed if the OR operation of previous carry-in (C i , j-1 ) and operand (a i b j ) is 0. When the output of OR gate is 1 then the addition operation is performed by half adder cell A+B+1 [42] . The main limitation of this multiplier is that it does not use bypassing approach and therefore consumes more power. In addition, this system is more complex and consumes larger amount of power at low frequencies due to the presence of no. of buffers.
4.PROPOSED MULTIPLIER
Array multiplier consists of rows of adder cells. The sum and carry signals generated from the previous rows are fed into the next rows. Evidently, adders are the major power and area consuming unit of the multiplier. The power consumption of a multiplier can be lowered by reducing the switching transitions and hardware cost of the adder cells.
Switching transitions at the adder cells of the proposed multiplier can be lowered using new improved column bypassing scheme (ICBS) achieved using power gating approach. The proposed multiplier selects the ICBS only if the multiplicand a i is zero as shown in figure 11 . Power gating saves more power by temporarily disabling the supply voltage (V DD ) to the selective blocks which are not functional during that period. Therefore, this approach leads to lesser power consumption and area than that of the buffers used by the previous designs. Besides, the performance of buffers is very poor at low frequencies. Hence buffers may not be good choice for low power, low frequency applications.
The occurrence probability of zero in a multiplier can be described by the following equation.
( ) ( )
Where n is the number of bit in the multiplicand A and multiplier B, D i is the effective data and prob is the probability of specified effective data. Based on the equation (4), the probability of zero in actual multiplier implementation such as adaptive differential pulse code, G723.1 speech code and wavelet based image coder is over 65%. This is more than uniform distribution probability [10] . This proves that bypassing used in multiplier is much better scheme for power saving. Therefore, the ICBS has been used to design the proposed multiplier. It has been tested that the proposed to be the best for low frequency applications (≤ 50 MHz) such as assistive listening technology. This multiplier also performs better for high frequency applications (≤ 333.3 MHz) than the designs available in the literature. figure 10 has been designed using fewer hardware components. This adder has lesser area, propagation delay and power requirement than the previously discussed multipliers. If carry-in is 0, the addition operation is performed by the half adder and sum and carry outputs are selected by multiplexer. The inverted output of half adder cell and or operation of half adder inputs are selected by multiplexer as sum and carry output, if carry-in is 1. Hence, the functionality of full adder cell is obtained using reduced number of transistor in the proposed adder cell. figure 12 . In (j-1) th row, initial carry is fixed to zero, it replaces the full adder cells with half adder cells as shown in figure 11 . In this row, multiplicand bits a 0 and a 1 are 0, therefore the vectors (a 1 b 0 ) and (a 2 b 0 ) are bypassed to (j) th row by shuting down their respective adder cells with ICBS. In adder cell 3 of (j-1) th row, addition operation is performed by the half adder as the value of multiplicand bit a 2 is 1. It generate sum and carry outputs as 1 and 0 respectively and these outputs are passed to the (j) th row adder cell as shown in figure 12 . In the next rows, carry input is may be zero or one. Therefore the logic applied on (j-1) th row is not applicable in succeeding rows. In (j) th row, the carry-in (C i , j-1 ) propagating from (j-1) th controls the addition operation when bypassing scheme is not selected. In this row, a 0 , a 1 , and carry-in are 0 for adder cell 1 and adder cell 2, therefore, the bypassing operation is performed with ICBS and sum outputs propagating (j-1) th row are selected by multiplexers at the output. For adder cell 3 of (j-1) th row, bypassing is not selected and addition operation is performed by proposed adder cell as multiplicand bit a 2 is 1. Carry-input of this adder cell is 1 and it will control the addition operation. Therefore, the inverter output and OR operation of half adder inputs are selected as final sum (0) and carry (1) outputs by multiplexer of the proposed adder cell at the output. Similar operation is repeated in all successive rows of the multiplier. Finally, 10000100 is obtained as output vectors. 
5.RESULTS AND ANALYSIS:
The performance comparison of the proposed multiplier along with the existing multipliers is presented in this section. Performances are compared in terms of power dissipation, worst case dealy, power delay product and area overhead. UMC (United Microelectronics Corporation) 90 nm CMOS technology is adopted to implement the proposed multiplier and existing multipliers using cadence virtuoso tool. Cadence spectre simulator tool is used to estimate the power consumption and worst case delay. Results of proposed multiplier are compared with braun, row bypassing, column bypassing and row and column bypassing multipliers. All the multipliers have been designed for 16 bit and 8 bit multiplication operation. The comparison of power consumption at different operating frequencies ranging from 1 MHz to 333.3 MHz is shown in Table 1 . The propagation delay has been calculated for the frequencies ranging from 1 MHz to 333.3 MHz. However the delay of different multiplier at 250 MHz operating frequency is shown in Table 2 . The delay is observed from 50% of voltage level of input to 50% of voltage level of resulting output for all the rise and fall transitions. Similarly, comparisons of power delay product for frequencies ranging from 1MHz to 333.3 MHz and area overheads of these multipliers are shown in Table 3 and Table 4 . For power delay product, worst case delay is chosen to be the larger delay amongst the all outputs. In this work, the input test patterns are taken randomly with an equal occurrence probability of zero's and one's i.e. the probability of 0 and 1 are 50%. For 16(×)16 bit multiplier, the proposed design achieves 73 and and 27 percent reduction in power consumption and worst case delay at 250 MHz operating frequency as compared to braun multiplier. Similarly, proposed 8(×)8 bit multiplier achieves 59 and 40 percent reduction in power consumption and worst case delay at 250 MHz operating frequency as compared to braun multiplier. The improvement in power consumption at different operating frequencies of the proposed multiplier as compared to existing multipliers is shown graphically in figure 13 and figure 14 . In addition to this, for 16(×)16 bit multiplier, the proposed design achieves 80 percent reduction in power delay product at 250 MHz operating frequency and 17.1 percent reduction in area overhead as compared to braun multiplier. Similarly, proposed 8(×)8 bit multiplier achieves 75 percent reduction in power delay product at 250 MHz operating frequency and 17.8 percent reduction in area overhead as compared to braun multiplier. From the plots, shown in figure 15 and figure 16 , it is found that the power delay product of proposed multiplier is much better at low frequencies (≤ 50 MHz) and also better at high frequencies (≤ 333.3 MHz) when compared with existing multipliers. Therefore, the proposed multiplier is a better choice for low frequency applications such as digital hearing aids and also for high frequency applications (≤333.3 MHz) as well.
6.CONCLUSION
A low power, high speed proposed multiplier architecture with improved column bypassing scheme has been presented this work. A new adder with optimized hardware is also proposed. The architecture of this adder reduced the power consumption and propagation delay, when ICBS is not in use. Simulation results show that the proposed multiplier architecture facilitates reduction of switching transitions and leakage power. It is also found better in terms of area occupancy and propagation delay. While testing, the input test patterns are taken randomly with an equal occurrence probability of zero's and one's. The proposed multiplier can achieve more power saving if the input test pattern has more no. of zero's than the no. of one's. It has been verified that proposed multiplier outperforms previously designed multipliers more effectively at all frequencies and ranks much higher in performance when used for low frequency applications. Therefore, proposed multiplier can be a better choice for assistive listening technology such as hearing aids.
