In this paper, a new design technique for column-compression (CC) multipliers is presented.
Introduction
The speed of a computer or signal processor ALU depends to a large extent on the speed of the multiplier, and, over the last few decades, many high performance multiplier algorithms and architectures have been proposed [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] . Multiplier architectures can be classified into two categories: (1) linear parallel (LP) multipliers [1] [2] [3] [4] [5] [6] ; and (2) column-compression (CC) multipliers [7] [8] [9] [10] [11] [12] [13] . The carry-save adder (CSA) multiplier [1] is a typical LP multiplier. Its simple and regular structure is easily implemented in VLSI technology and it has become the most popular high performance multiplier architecture. The time delay of the LP multiplier is a linear function of n , the size of the multiplier and, for large n , the speed of the multiplier is thought to be too slow for advanced signal processors. Recently, several new architectures have been proposed for the LP multiplier [3] [4] [5] [6] which can almost double the speed and maintain, to a certain extent, the simple structure.
The principles for the CC multiplier were established by the early work of Ofman [7] , Wallace [8] , and Dadda [9] . It has been shown that the delay of the CC multiplier is proportional to log 1.5 n [10] , and the CC architecture is widely accepted as time optimal. The irregularity and complicated interconnections of the CC multiplier do not, however, readily allow efficient VLSI implementations, particularly for large n . A CMOS implementation of an 8 × 8 Dadda multiplier was reported in 1988 [11] , and multipliers using 4-2 compressors [12, 13] and (7, 3) counters [14] , to improve the speed and/or regularity, were recently reported.
In the original work of Dadda [9] , many different size counters were proposed to compress the columns; however, the concentration was on column compression by (3, 2) counters (full adders) and (2,2) counters (half adders). Dadda showed that different schemes, including the scheme proposed by Wallace [8] , require different numbers of cells (counters); the optimal scheme, which requires the least number of cells, is to compress the column size so that the height of each column follows the recursion of eqn. (1) :
where .   represents the floor function. The series so generated is shown in (2) k 0 1 2 3 4 5 6 7 8 9 ... σ(k) 2 3 4 6 9 13 19 28 42 63 ...
A multiplier using such a compression scheme is normally referred to as a Dadda multiplier [10, 11] . In the compression part of the CC multiplier, adders are partitioned into stages. Except for cross-stage interconnections, the outputs of adders in one stage feed directly to the next lower stage. A natural VLSI layout of such an architecture is to arrange the cells in a pattern that renders the length of the interconnections as short as possible. For example, an 8 × 8 bit Dadda multiplier requires four stages of adders for its CC part [11] . The distribution of adders is, starting from the bottom, {12,10,14,6}. If we move some adders from stage 3 to stage 4, to make the distribution of adders more even, the interconnections of those adders to stage 2, and the number of cross stage interconnections, have to be increased. This paper discusses techniques for redistributing adder cells so that local connectivity is not sacrificed.
In order to provide a figure of merit for our designs, we introduce a metric that provides an indication of the efficiency of the silicon area for CC multiplier architecture designs. Let us treat the silicon area of an adder as a unit. Since the width of the silicon area depends on the stage which contains the largest number of adders, we define the area efficiency of the CC part of the multiplier as:
where N is the total number of the (full and half) adders for the CC part of the multiplier; K is the required number of stages, and N(k) is the number of adders in stage k . With the definition given by (3), the area efficiency for an 8 × 8 bit Dadda multiplier is 42/56=75%, and for a 12 × 12 bit Dadda multiplier is 73.3%.
The CC multiplier, using the 4-2 compressor or (7,3) counter as a building cell, may improve the speed and/or structural regularity [9, [12] [13] [14] ; however, problems associated with long wiring and area efficiency are exacerbated, as can be seen in Fig. 1 . Fig 1(a) shows part of a 9 × 9 bit Dadda multiplier using adders, Fig. 1 (b) part of a 16 × 16 bit 4-2 compressor multiplier, and Fig. 1 (c) part of a 32 × 32 bit multiplier using (7, 3) counters. The number marked on each cell in Fig.1 indicates the binary weight of that cell. The area efficiencies of the adder, 4-2 compressor and (7,3) counter multipliers are 58%, 58% and 53%, respectively.
Both the Dadda and (7,3) counter multipliers have cross-stage interconnections, but they do not have interconnections within one stage; on the other hand, the 4-2 compressor CC multiplier avoids cross-stage interconnections, but at the expense of long-wiring interconnections within each stage. The longest interconnection for the three cases are 2, 4, and 7 cell widths, respectively. Since a cell of either 4-2 compressor or (7, 3) counter is wider than the adder cell, the actual maximum length of wiring for these two multiplier types is longer than the width ratios suggest. From Fig. 1 we may conclude that if the length of wiring and area efficiency are prominent cost function elements, then the Dadda multiplier is the most efficient of the three. 
A Lower Bound on the Total Number of Adders
Dadda showed that different schemes for column compression require different numbers of adders [9] . This section determines the lowest bound of that number.
For an n × n bit multiplier, the partial product of two n bit numbers, a i b j ∀i, j ∈{0,1,...,n − 1}, forms a matrix of n rows (an example for n = 8 is shown in Fig. 2a ).
Each row contains n elements with a shift of one element to the left from the preceding row.
Thus the matrix contains n rows and 2n − 1 columns and can be alternatively represented by Fig. 2b . We assign an ascending integral index, starting from 0, to columns from the right to the left. Thus the j th column indicates that the elements in that column possess a weight of 2 j .
All existing multipliers deal with methods to sum up the partial product matrix to obtain the final result; that is, a matrix with just one row of 2n elements. Let p( j) be the number of partial products in the j th column. From Fig. 2 it is easy to determine that: Our purpose is to reduce the size of each column to one by using full or half adders. A full adder in column j can absorb three bits in that column and a half adder absorbs two. For both adders a sum bit goes to column j and a carry bit to column j + 1. A full adder in column j will reduce the column size by two and a half adder by one, and both adders increase the size of column j + 1 by one. Clearly, full adders have priority over half adders in column reduction.
Let e( j) be the expanded size (number of entries) of column j , which includes partial products in column j and the carries from column ( j − 1). 
where
Column 0 contains one partial product, a 0 b 0 , and therefore no adders are required for that column. Thus, q 0 (0) = 0. With this initial condition, it is easy to obtain:
The total (necessary and sufficient) number of full and half adders to reduce the partial product matrix to a matrix with just one row is given by
Among these n(n − 1) adders, n of them are half adders because there are n columns with an even number of entries. given by:
Among them, n − 1 are half adders, distributed from column 2 to n . For example, for a 12 × 12 bit CC multiplier, the required number of adders is N = 11x10 = 110 . 11 of them are half adders 1 . Representing the required number of adders in column j for the CC part by q( j), clearly we have q( j) = q 0 ( j) − 1. Thus: 
Constraints For Adder Allocation
As will be shown later, the restriction on column size in each stage [9] (according to the series in eqn. (2)) may be violated when we allocate adders to different stages. However, eqn. (2) is valid in determining the number of the required stages for the unsigned CC multiplier implemented by full and half adders. If the size of the multiplier, n , satisfies:
then the CC part of the CC multiplier requires K stages.
After N and K are determined, there are many ways to allocate N adders to K stages. The distribution of N adders to the K stages will strongly influence the area-efficiency and the maximum length of interconnect. Since there are many distributions which yield the same result for the same number of cells, to find a distribution which provides the highest area-efficiency and yet minimizes the length of in-stage inter-connects and number of cross-stage interconnects is of great interest. Suppose N(k) adders are assigned to the k th stage, with q k ( j) adders being assigned to column j . Obviously:
In each column, we assign exactly the number required by that column. Thus
Each adder at column j and j − 1 in stage k produces an output in column j . The number of the outputs at column j produced by adders in stage k , given by q k ( j) + q k ( j − 1), should not exceed the number of slots (bit positions which can accept inputs) at column j provided by the lower stages, which is given by:
The first term on the right hand side is provided by the fast adder, the second term is produced by all adders in the lower stages of column j ; the last term is the number of slots which have been occupied by the adders already allocated at column j , and the carries produced by adders at column j − 1, in the lower stages. Thus, we arrive at the constraint:
The purpose of allocating adders in each stage is to provide slots to accept inputs; this is providing that the total number of adders required by that column have not been used up by the lower stages. Therefore, except for those columns where no more adders are needed, the number of slots provided by stage k in column j should not be less than the number of slots provided by stage k − 1 in the same column; or s k −1 j ( ) ≤ s k j ( ). Using eqn. (14) we find:
If q k j ( ) = 1, from constraint (16) we know that we have to allocate at least one adder to each column up to column J k , the highest index the adders to be allocated possess.
Under the constraints (12) (13), (15) and (16), together with condition (10), there are many schemes available to allocate adders to different stages using the minimum number of adders;
Dadda's architecture is one such scheme. In considering the area efficiency, the best choice is to allocate adders to each stage as evenly as possible to maximize the area efficiency. There is, however, an upper bound for the number of adders in each stage, and this is derived in the following section.
An Upper Bound on Adders in Each Stage
The number of adders in a specific stage reaches its maximum if no remaining adders can be put in any column. The maximum number of adders in stage k , N max (k), can be derived from the constraints in section 3. We start from stage 1.
The final fast adder provides two slots to each column ranging from 2 to 2n − 2 , and three slots, including the carry-in, for column 1. We have already shown that the adders for the CC part start from column 2. Constraint (15) tells us that if we allocate an adder to column 2, we have to allocate adders to every other column. The only choice, then, is to allocate one adder to each column. In this case, the slots on columns 3 to 2n − 3 of the final fast adder will be fully occupied, and thus the number of adders in stage 1 reaches the maximum, N max (1) = 2n − 4.
The adders in stage 1 provide three slots for columns ranging from 2 to 2n − 2 ; therefore, we can allocate three adders to each pair of two adjacent columns. However, there is only one adder remaining not allocated on column 3 and 2n − 3, and zero adders remain in columns 2 and 2n − 2 . Therefore, we can only allocate two adders to every other column ranging from column 4 to 2n − 4. For the other columns we can allocate one adder to each column only. In this way the number of adders at stage 2 reaches the maximum, N max (2) = 3n − 10 .
The number of adders that we can allocate to stage 3 relates to the number of adders we have In the same manner, we can determine N max (k) for the other stages. Table 1 shows the results.
For comparison, Table 1 also shows the number of adders for each stage of Dadda's scheme.
The fourth column,
∑ , is the number of adders contained in the first k stages that allows N max (k + 1) adders to be allocated to stage k + 1. In the fifth column of Table 1, we show the condition for calculating numbers of adders in columns 2 and 3 of the table.
Stage 
is given by the fourth column of Table 1 ; we assume N (0) = 0 . adders, we go to step 3 and repeat steps 3-5, until an appropriate N k is obtained.
Since we allocate to each stage either the maximum number of adders it can contain, or a number of adders which do not exceed the appropriate average number, the above algorithm is guaranteed to maximize the area efficiency.
An 8x8 Multiplier Example
Let us use an 8 × 8 multiplier as an example. From the series (2) and equation (9), K = 4 and N = 42, and the average number of adders for each stage:
Since N max (1) = 12 > N 0 , we thus allocate at most 11 adders to each stage.
Next, we distribute adders of the CC part to different stages; this is a heuristic procedure. For our case, let the distribution be {11,11,11,9} for stages 1 through 4. Although this is a somewhat arbitrary distribution, we find that there are not many other choices. For example, it is extremely difficult to allocate less than 11 adders to each of the first two stages. As soon as we assign 22 adders to the first two stages, there are only three possibilities to assign the remaining 20 adders to the last two stages while limiting the maximum number of adders in each stage to 11. The three possible distributions for the last two stages are: {11, 9}, {10,10}, and {9,11}. The allocation of adders to different stages and columns is facilitated by constructing a table such as shown in Table 2 . At the bottom of Table 2 In order to make the number of inputs to each column an ascending sequence, we need to allocate at least one adder to each column within constraint (15) . The allocation of adders to the first two stages, shown in Table 2 , is the only possible choice, based on non-violation of this Col. Index 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Table 2 An example distribution of adders for an 8 × 8 bit CC multiplier
We may also determine the number of cross-stage interconnections from Table 2 . Stage 3 produces 4 outputs in column 8 and 9, and the single adder in each of these two columns of stage 2 can only absorb three of them. This results in two cross-stage interconnections in these columns. In addition, there is no adder with index 2 on the first stage. The sum output of the adder in column 2 of stage 2 will be connected to the fast adder, crossing stage 1, and increasing the number of cross-stage interconnections to 3 for this multiplier.
Next, we construct a table with K rows and N 0 columns to represent the layout pattern of the CC part of the multiplier; each block represents a unit area of silicon which can physically accommodate an adder, and the blocks contain the binary weight of the adders (Fig. 4) . In order to measure the length of interconnections, we refer to each adder position by its table coordinate; for instance, adder (11,2) has weight 2. We use the cell width as the length unit.
We assume the layout of an adder is a square with inputs into the top and outputs out of the bottom. The connection of two adjacent adders (e. g. adder (3, 3) , weighted 9, to adder (3,2), weighted 10) will be counted as a length 0 connection. The length of connection between adder (i,j) and (k,l) is thus given by i − k + j − l − 1 cell widths. The longest interconnection is 3 cell widths from the carry of (11, 3) to input of (8,2). 42/44=95.5%, which is the maximum that we can reach. There are 3 cross-stage interconnections. The architecture given in Fig. 5 is therefore more efficient than the Dadda multiplier of reference [11] , both in terms of the area efficiency measure and in maximum length of interconnection . From many designs that we have tried, the architecture given by Fig. 5 has the highest area efficiency, based on our cost measures. 4 We have found several errors in this figure.
An improved architecture
We can trade area efficiency for maximum interconnection length. Fig. 6 shows an improved architecture in this regard. The area efficiency of this architecture is 42/48= 87.5%, 7 percent lower than the architecture given by Fig. 5 ; however, all cross-stage interconnections have been eliminated, and all interconnections are either to the nearest or to the next nearest neighbor. Thus, the maximum length of interconnection reduces form 3 to 1 cell widths. We feel that these advantages are well worth the 7% reduction in area efficiency. For either Dadda's scheme or the scheme given in the last section, the length of the final fast adder is 2(n-1), which is always larger than the maximum number of adders for the first stage of the CC part, 2(n-2). We can shorten the length of the final fast adder to raise the area efficiency, since each stage can reduce the length of the final adder by one bit, with a total reduction of, at most, K bits. We will refer to the extreme case, i.e., using a length 2n − 2 − K adder for the last adder, as Approach II. The maximum number of adders and N m (k) for approach 2 are listed in Table 4 . The number of adders that the CC part contains is
The allocation procedure is the same as described in the previous section. In some cases, it may be more advantageous to reduce the length of the final adder by a number less than K ; examples of these cases will be shown later.
As shown in Table 4 , if K ≥ 2 , the maximum number of adders for the first two stages of the CC part of the multiplier for Approach II is less than that of Approach I. Thus a large number of adders have to be allocated to the higher stages and so the area efficiency will be reduced.
We have found that Approach II is very suitable for multipliers whose length is n ≤ 11.
Except for 9x9 bit and 11x11 bit multipliers, we have found that all cross-stage interconnections can be eliminated by both approaches. Table 3 The Maximum Number of Adders for Each Stage for Approach II Table 3 lists the maximum number of adders for each stage using approach II for short length A New Design Technique for Column Compression Multipliers 20 multipliers. Fig. 7 shows the design of the 8x8 multiplier using Approach II. Fig. 7 8x8 bit CC multiplier designed using Approach II As with the design in Fig. 6 ; all interconnections are either to the nearest or to the next nearest neighbor and there are no cross-stage interconnections. The new design raises the area efficiency from 87.5% to 97.5%. In addition, the length of the final adder has been reduced from 14 to 10.
Design of Two's Complement Multipliers
The technique developed in the previous sections can also be applied to the design of two's complement multipliers. The two's complement multiplication algorithm, developed by Bouch and Wooley [15] , and improved by Blankenship [16] , is adopted, and we limit our discussion to n × n multipliers. The principles can be readily applied to non-square multiplier designs.
The partial product matrix for an 8 × 8 two's complement multiplier, according to [15] and [16] , is shown in Fig. 8 . Fig. 2 to Fig. 9 , we find that three more partial products are required by the two's complement multiplier. One of them is located at column 2n-1, which is identical to the partial product at column 2n-2, and is generated by the INCLUSIVE OR [16] . The other two extra partial products, located at column n − 1, are generated by copying the most significant bits of the multiplicand and the multiplier [15] . Except for these four partial products, all other partial products are generated by AND gates.
The maximum column size of an n × n two's complement multiplier is n + 2 , two more than the maximum column size of an unsigned multiplier. If we strictly follow series (2) , then the size of the multiplier, n, is related to the required number of stages, K, by:
For example, the maximum column size of an 8 × 8 two's complement multiplier is 10, and following (2), the CC part of the multiplier would require five stages.
Since the size of the largest column is three larger than the size of its two neighboring columns, we can replace (18) by:
For example, the column part of an 8 × 8 two's complement multiplier can be implemented in four stages, as will be shown later.
Following the same procedure given in section 2, an n × n bit two's complement multiplier requires n 2 − n + 1 adders, one more than the conventional unsigned multiplier, together with an EXCLUSIVE OR (XOR), which is located at column 2n-1. Since at least one of the two entries in column 2n-1 is zero, there is no carry to propagate to column 2n, and an XOR suffices to produce the most significant bit at column 2n-1. Among the n 2 − n + 1 adders, there are n-1 half adders from column 1 to column n-1. The number of adders required by column j is given by
We now allocate adders to different stages observing the constraints. The XOR in column 2n − 1 can be included in the final fast adder. In the CC part of the two's complement multiplier only one more adder, at column n-1, needs to be allocated compared to the unsigned multiplier. One simple way is to start from the architecture of the unsigned multiplier and include the extra adder in column n-1. For example, starting from Fig. 6 (Approach I), we obtain the 8 × 8 bit two's complement multiplier architecture shown in Fig. 9 . where the underlined number indicates the complement. For example, 07 stands for a 0 b 7 , 73, stands for a 7 b 3 , etc. Because of the additional XOR, the length of the final adder is one bit longer than that for the unsigned multiplier. Most features of the unsigned multiplier design of Fig. 6 are retained with the addition of three intra-stage interconnections that are not to the nearest or next nearest neighbor. Fig. 9 Design of 8x8 bit two's complement CC multiplier with approach I In designing the 8 × 8 two's complement CC multiplier using Approach II, the additional adder in column 7 can considerably reduce the area efficiency if the length of the final adder is only reduced by 4 (the number of CC stages). In order to achieve a higher area efficiency, we can reduce the final adder by an extra bit, as shown by Fig. 10 . The area efficiency for the CC part of this multiplier is 46 48 = 95.8%, while the length of the final adder is only 3 bits shorter than the final adder of Fig. 9 . Again, except for three intra-stage interconnections that are not to the nearest or next nearest neighbor, all features of the multiplier given by Fig. 7 are retained. 
Conclusions
In conclusion, we have presented two approaches to the design of CC multipliers for both unsigned and two's complement multiplication. Our approaches do not use classical restrictions on column height sequence, and provide flexibility for distribution of adders to different stages. The number of adders used for our first approach is identical to that obtained using classical design techniques, but our technique allows more flexibility in adder placement and yields higher area efficiency. The second approach yields, in most cases for short length multipliers, higher area efficiency by reducing the length of the final adder, while maintaining the major characteristics of the first approach. In comparison with conventional design techniques, both of our design approaches increase the regularity and reduce interconnection lengths of the multiplier layout.
Acknowledgments:
Glossary of Terms Table Captions   Table 1 The Maximum Number of Adders for Each Stage Table 2 An example distribution of adders for an 8 × 8 bit CC multiplier Table 3 The Maximum Number of Adders for Each Stage for Approach II
