I. INTRODUCTION
From one point of view, multipliers can be categorized to sequential and combinational ones. Sequential multipliers are In recent years, power consumption has become a critical attractive for their low area requirements. They, however, design concern for many VLSI systems. Especially, it is an take more time to complete a multiplication operation important bottleneck in portable battery-operated compared to combinational ones. In this work, we propose a applications where the power consumption may be more pre-computation based technique to lower the power important than speed and area. In CMOS technology, a great consumption of sequential multipliers. The paper is deal of power dissipation is caused by charging and organized as follows: We describe the proposed multiplier discharging of the load capacitances. Therefore, it is crucial architecture in Section II and the benchmark circuits in to minimize the number of signal transitions in circuits for a Section III. The results and discussion are presented in low power design [1] .
Section IV while the summary and conclusion are given in Because of the frequent use of arithmetic units such as Section V. multipliers and adders and their high power consumption, many low-power techniques have been proposed to optimize II. PROPOSED TECHNIQUE these functional units in terms of power consumption (see, In a sequential multiplier, the multiplication process is e.g., [2] [3] [4] [5] ). Among other computing systems, DSP divided into some sequential steps. In each step some partial applications make extensive use of multiply and accumulate products will be generated, added to an accumulated partial computations. Therefore, the design and the implementation sum and the partial sum will be shifted (towards left or right, of power-efficient arithmetic units, especially multipliers, is depending on the scheme in use) to align the accumulated essential for the design of low-power DSP hardware [6] .
sum with the partial products of next steps [7] . Therefore, Several power reduction techniques, in different levels of each step of a sequential multiplication consists of three abstraction (from system and architecture levels to logic and different operations which are generating partial products, circuit levels), have been proposed in literature. Some of adding the generated partial products to the accumulated these approaches, such as asynchronous multiplier partial sum, and shifting the partial sum. Fig.1 All other higher bits will be predictable to be 0 or 1, based on Cvy r : the sign of the result. Therefore, one can avoid computing those higher bits, and let the final correction phase produce Figure 1 . Row-by-row addition in a sequential multiplier them. As we will see later, this final correction can be just a simple arithmetic or logical shift in most cases. In what follows, the terms multiplicand and multiplier refer to the first and second operands of a given Based on the above discussion,in each multiplier multiplication, respectively. In each multiplication step, one implemented using the proposed technique, there are two or more multiples of the multiplicand are generated and additional phases, besides the normal multiplication steps.
added to the partial sum through a two-or multi-operand These phases are the pre-computation phase and the final addition operation. correction phase. To reduce the power consumption of the multiplier, we A. Pre-computation Phase make use of the fact that in each sequential circuit, a great deal of power consumption is consumed by bi-stable In order to apply the technique to a sequential elements, like flip flops whose dissipated power is multiplication scheme, an initial pre-computation should be proportional to the clock tick count. Hence, reducing the performed on the operands. This pre-computation determines clock tick count required for the completion of a the actual number of multiplication steps needed to compute mulocktipication count lweq r thepr conumption of the flipthe final result. If the calculated number is smaller than the multiplication can lower the power consumption of the flip-maximum number of steps, it will lead to reduction in the flops and increase the speed of the circuit. This will reduce requie numberton steps.
the power delay product (PDP) factor ofthe circuit.
required multiplication steps. The proposed technique is based on the observation that
The pre-computation phase, which basically counts the in each sequential multiplication scheme many of the number of leading zero/ones in the operands, may be generated partial products are trivial values (i.e., 0) and implemented through a priority encoder circuit [8] . The adding them to the partial sum will have no effect on the nature of the pre-computation phase, though similar for partial sum. This phenomenon is more common in low radix different multiplication schemes, somewhat depends on the multiplier (Radix-2 or Radix-4) than in high radix multiplication scheme in use. One should note that between multipliers.
the two operands in the multiplication process, to reduce the required clock count, the multiplier should be selected as the Considering these facts, one can lower the power operand with the lower MSP. consumption and increase the speed by eliminating all multiplication steps associated with the generation and the In the multiplier implementations presented in this work, accumulation of trivial partial products. Consider a group of we have used priority encoder circuits, tailored to the such multiplication steps at the end of the multiplication, specific multiplication scheme in use, to do the necessary Because all the generated partial products are 0, adding them pre-computation. Each pre-computation circuit will calculate to the accumulated partial sum will have no effect on the the necessary clock count for each multiplication using the accumulated sum. One can skip over all such additions to MSPs ofthe operands save power and time. This is achieved by skipping over the whole corresponding multiplication step including the shift B. Final Correction Phase operations. At the end of the multiplication process, the The exact operation of the final correction phase depends results should be corrected by properly shifting them through on the effect of the omitted shift operations on the final a simple shift operation.
result. In many cases, this final correction is just a simple 
III. BENCHMARK CIRCUITS
To generate the inputs of the filters, we have selected We have applied the proposed technique to three 16 x 16 "ringin.wav" from the media files of MS WindowsTm and sequential multipliers which are signed modified booth applied the filters to this file and its scaled-up and scaledmultiplier (MBM), signed modified booth multiplier with down versions. The maximum amplitude, for the file, its carry save addition (CSA MBM), and unsigned shift and add scaled-up, and its scaled-down versions were 0.7, 1.0, and multiplier (SHAM).
0.2, respectively. We have also applied a white noise input
The general structures of the multiplier with and without with the amplitude of 1.0 and the power of 1.5 watts. the proposed technique are shown in Fig.2 . Table I shows the Therefore, we will have 8 sets of data for each multiplier. calculated number of required multiplication steps for Fig.3 shows the MSP distribution of the benchmark data. multipliers with different MSPs.
The percent of switching activity and clock tick count For the two signed multiplication schemes, the fnal reduction for each circuit are given in Tables II and III , correction phase is an arithmetic right-shift operation and for respectively. In these tables, BP and LP refer to data from the unsigned scheme, it is a logical right-shift operation. T .the band pass and low pass filters, respectively. Also, NS, the~~~ũnindshm,i.salgclrgtsitoeain h US and SD refer to the data derived from non-scaled, scaledshift count is equal to the number of shifts that had to be .
. In this paper, we proposed a pre-computation based multiplication speed of sequential multipliers. The proposed technique is based on the fact that all multiplication steps associated with the generation and the accumulation oftrivial partial products can be eliminated at the end of the multiplication. This reduces the required clock tick count and switching activity of the multiplier. 
