A highly regular parallel multiplier architecture along with the novel low-power, highperformance CMOS implementation circuits is presented. The superiority is achieved through utilizing a unique scheme for recursive decomposition of partial product matrices and a recently proposed non-binary arithmetic logic as well as the complementary shift switch logic circuits.
INTRODUCTION
The traditional designs of parallel (array) multipliers [1, 3-6, 17 ] mainly rely on the use of fast (3, 2) and (4, 2) parallel counter circuits for high speed. However, the traditional approaches have the following problems which hinder achieving a general VLSI high performance in the design of larger-size (say 64 x 64-b) high-speed multipliers: (1) the design irregularity inherited from the bit reduction of a large partial product matrix (even with Booth recoding); (2) the load/wire unbalance caused by the unbalanced column heights of the partial product matrices generated in many (5 to 10) reduction stages; (3) quite a large power dissipation. * The work was supported, in part, by National Science Foundation under grant CCR-0073469. In this paper we propose a highly regular parallel multiplier design based on recently proposed unique decomposition approach for partial product matrix reductions [10] . The proposed 64 64-b parallel multiplier shows the following distinct features" (1) Distributing input bits to 64 locations using a full 4-branch tree structure, then at each location generating an 88-b partial product matrix, instead of a single large one as commonly adopted by the existing designs (including those with Booth recoding). (2) Comprising only four stages of bit reductions (each corresponding to a sub-multiplication module):
First, by 64 identical 8 8-b small parallel multipliers. Second, by 16 identical arrays of (6, 2) shift switch parallel counters. And for the remaining two stages, by 4 Though the novel multiplier may be implemented using any existing small (say 8 8-b) multipliers and small parallel counters (say traditional half-full adders and (4, 2) counters [4, 5] ), a family of shift switch counters and variants, including non-binary 4-bit signal based (6, 3)* and complementary (k, 2), 2 _< k _< 8 counters (both will be defined shortly below), are adopted to achieve low power dissipation, while keeping high VLSI performance in speed and area. The recently proposed shift switch logic circuits [8] [9] [10] [11] [12] (3, 2) counters in parallel to result in two numbers (note that the two numbers are not added until the final stage) as the virtual product of the 8 8 multiplier (Fig. ld) .
To simplify the summation as illustrated in Figure 2 represents a useful order that we call square order.
It is also easy to verify that the process implements the right part of the following algebraic equation: .. s12 s12" s11s11" slOslO" sgsg" sSs8" s7s7" s6s6" s5s5" s4 s3..sO We apply the re-positioning recursively onto a larger partial product matrix as shown in Figure 3 . In Figure 3a the original partial product matrix A", produced by two 16-b numbers X (plain) and Y (bold), is decomposed into two levels of square sub-matrices. In Figure 3b The full-4-branch complete binary tree of inputs (3 levels are shown).
sub-matrices of the decomposed partial product matrix is a full 4-branch tree of 2 levels with better load/wire balance compared to traditional approaches. Figure 4 illustrates the full 4-branch tree distribution of two 64-bit inputs X and Y to the partial product matrices in 4-levels (levels 1, 2 and 4 are shown). Figure 7b , not the one in Figure 7a . Note that other forms of 4-bit shift switch parallel counters may be obtained through slight modification of the two proposed circuits for some other specific purposes (refer to [9] [10] [11] [12] ). Figure 9 shows (k, 2) complementary counters for k 3, 4 and 6. The (6, 2) parallel counter of Figure 9d includes An alternative scheme is to use a non-binary parallel counter (6, 3)* of Figure 7 , plus a complementary (3, 2) counter of Figure 9a or its variants in each column to reduce 6 input bits into 2.
A 4-bit state signal as shown in Figure 7 represents a decoded form of a binary number with an integer value between 0 to 3. In Figure 7 , the initial 4-bit state signal X formed by bits x0, xl, x2, x3 of the (6, 3)* counter has a value equal to il +i2 /i3, note that the unique bit of X is a level-swing signal and will be restored later. In this section we characterize the low power natures of the proposed non-binary arithmetic circuits. Since the logical superiority of the circuits for low power dissipation may be best captured by the typical (6,3)* parallel counter illustrated in Figure 7b , we redraw the circuit in Figure 10 focusing on illustrations of power dissipation activities occurred along signal paths.
As addressed in [2] , the four sources of power dissipation in digital CMOS circuits can be summarized as: (1) Figure 7b is shown in Figure 10 . [9, 11, 13] ) and possess several, advantages for low power dissipation as described above. Tables I  and II show the circuit simulation results for the critical paths of the 8 x8 multipliers and the related parallel counters respectively. Figure 7b , column B is for the path using complementary (k, 2) parallel counter of Figure 9 , and column C is for the path using (4,2) parallel counter of [5] . (2) 
