There has been recent awareness of the drastic effects of interconnect delay in VLSI implementations, and several investigations focused on this problem have been linked directly to multiplier structures. The tree, or column compression techniques, used for partial product reduction have the severe impediment of highly irregular interconnections. A digital multiplier architecture will be presented in this paper that alleviates some of the problems associated with interconnect scaling, in addition to allowing for simple variable precision reconfiguration.
INTRODUCTION
The recent growth in microprocessor performance has been a direct result of designers exploiting decreasing device feature sizes while at the same time deepening the depth of pipelines. As transistor sizes continue to shrink, the traditional gains associated with smaller feature sizes will be degraded due to the adverse effects of wire scaling. There bas heen recent awareness of the drastic effects of interconnect delay in VLSI implementations r1-41, and several investigations focused on this problem have heen linked directly to multiplier structures.
The uee, or column compression techniques, used for partial product reduction as described by Wallace and Dadda [6, 7] have the severe impediment of highly irregular interconnections. This arises due to the unique union of the counter blocks in each column corresponding to each partial product weight. Several The architecture proposed in this paper is centred on a recursive multiplication algorithm by Danysh and Swartzlander [ll] . The authors present a multiplication algorithm based on divide and conquer methodologies , similar to the Karatsuba Ofman Algorithm, that introduces greater regularity in design than standard column compression multipliers, while avoiding the linear latency of array multipliers.
Recent studies (2-41, have examined the consequences of technology scaling on arithmetic circuitry. These investigations strongly support the need for t h e consideration of interconnect layout as an integral part of future arithmetic circuitry. The predominant advantage offered by the recursive multiplication scheme [I 11 is the use of smaller multipliers to implement a larger operation, which is in direct compliance with the presented results.
This structure promotes the notion of exploiting locally optimized arrays for reduced interconnect power through shorter local interconnects, and a more regular integration of the sub-components on a larger scale.
In the subsequent sections of this paper, we will reintroduce the recursive multiplier algorithm and its implementation i n a reconfigurable architecture will conclude the discussion.
RECURSIVE MULTIPLIER
The recursive multiplier scheme works by executing an n-hit multiplication using 4 nl2-bit multipliers in parallel and adding up the results. The nl2-bit multipliers may he further reduced, where each sub-multiplier carries out 4 parallel nl4-hit multiplications, and so forth. In this manner a large multiplication is carried using recursions of simpler base multiplier modules.
Mathematically, the recursive algorithm may hc proved by first considering two unsigned n-bit operands, the multiplier X and multiplicand A:
By dividing each of the two operands into 2 m-bit values, where m = w2, we obtain:
X and A may now he redefined as:
The overall multiplication ofA and X is given by:
Therefore, the overall multiplication may he reduced to four smaller multiplications, and this process may he repeated using even smaller base multipliers. In order to minimize the delay introduced by subdividing the process, the result of the base multipliers, or the intermediary products, will he kept in carry save form; hence only one final fast adder will he required to yield the final product. 
Fig. 1. Single level recursive multiplication
Each of the 4 n-hit intermediary products in carry save format will occupy a given series of bit positions as outlined in Fig.1 . It becomes apparent that there will be 3 intermediary products that will overlap from hit ( 4 2 --I) to (31112 -1). Consequently that leads to 6 bits that must he reduced to 2 to provide one final product in carry save form. A 6:2 reduction scheme has been proposed fo,r the recursive multiplier [11, 12] , which introduces at most a delay equivalent to three full adders. The reduction circuit is formed by an interconnection of reduction sub-blocks, each composed of a chain of full adders generating ii two hit output value, along with inter-block carry signals: that propagate laterally along the reduction suh-hlock array.
A modified and fully optimized version of the reduction scheme is introduced in Fig. 2 . The 6 inputs suhblocks exist from hit position ( n12 -1) to ( 31112 -l), followed hy a two input reduction cell, a full adder, and a series of half adders for the remaining hits. To examine the optimal size of the base multiplier required, and consequently, the number of recursions, it is imperative to examine the relationship between base multiplier size and the associated delay. Assuming that each Full Adder has a gate delay of 3 [I 11 , 1 gate delay is required for the initial partial product generation, and {here are 2[logZh -11 stages required for a Dadda Multiplier up to 64 hits, we will then expect a delay of:
If n is the number of hits of the overall multiplier, and b the hits of the base suh-multiplier, then the number of recursions required is:
Since each recursion will require a 6:2 reduction stage, corresponding to 9 gate delays, the delay relationship for the overall scheme will he: D,~cu,s,ve = 9 1% (g) 
Dadda multipliers as a function of overall delay
A three dimensional plot depicting overall delay as a function of both base and recursive multiplier size is given in Fig. 3 . I t becomes evident that the most efficient recursive multiplier corresponds to only one level of recursion, independent of overall multiplier size. This in turn corresponds to adding at most one full adder delay to that of a typical non-recursive Dadda tree. Furthermore, the recursive architecture using column compression base multipliers demonstrates considerable performance gains over an array multiplier.
RECONFIGURABLE MULTIPLIER ARCHITECTURE
The intent of the re-configurable architecture is to provide a means by which the performance of arithmetic hardware may be enhanced according to the desired function. For example, many modern DSP chips offer variable precision [ 13,141, or fault tolerant arithmetic implemented using software [15] . "be operation of these devices may h e ameliorated if such functions were I I I /
Fig. 4. The Reconfigurable Multiplier Architecture
executed directly on hardware. Since multiplication is considered to be the dominant computation in most DSP algorithms, the formulation of double-precision multiplication using iterative single-precision operations is an inefficient compromise to variable precision arithmetic. The multiplication architecture presented envelopes the concepts of fault tolerant computing, low power design, and high throughout arithmetic into one design. Although variable precision arithmetic has been suggested in the past [ 5 ] , their focus for the most part has been on FPGA implementations. Such techniques'do indeed offer a considerable performance edge without the need for additional software overhead; however, dedicated application specific hardware offers the potential for increased savings in resources, power and latency. A reconfigurahle fixed hardware parallel inner processor has been suggested L161. However, the complexity associated with the multiple levels recommended, as a result of the small base multiplier sizes, is in direct contrast to the derivation of optimum delay for recursive structures presented above.
The proposed design offers four modes of operation, without the necessity to completely reconfigure the internal layout of a programmable device. The recursive multiplier architecture with one level of recursion will he used as the foundation for the re-configurable architecture (Fig.4) . In addition to the basic multiplier, a series of multiplexers, will be used to guide the signal flow through the device. Since all of the necessary components for each mode of operation are present in the design, there will be no reconfiguration time required, enabling the device to switch between modes of operation in real time.
The proposed scheme utilizes a 2-bit control signal to select one of four modes of operation:
DOUBLE PRECISION MODE:
The default double precision mode is simply a recursive multiplier with one level of recursion. Having a single recursion stage makes this multiplier the most efficient recursive multiplier scheme; The regularity of the design is increased, and tbe interconnect complexity compared to a typical tree multiplier of the same size is substantially reduced.
SINGLE PRECISION MODE:
Single precision mode uses gating techniques to shut down three of the base multipliers, effectively cancelling over 75% of the circuit, in addition to the reduction circuitry and the majority voter. The effect of this mode of operation is similar to that of clock gating techniques employed in low power design, used to blackout idle portions of a circuit. Moreover, the overall latency now becomes that of the base multiplier, allowing faster operation in single precision mode than would he possible if the entire circuit was active.
DUAL SINGLE PRECISION MODE:
The input bus may have two sets of operands entering the circuit concurrently (SIMD), according to the high and low order hits. With two of the base multipliers inactive, the remaining two multipliers may operate in parallel on two different sets of operands, effectively doubling the system throughput, with a latency of a single precision multiplier.
SINGLE PRECISION FAULT-TOLERANT MODE:
Although there are numerous methods of implementing fault tolerance in digital systems, one of the most basic methods is through majority voting between three duplicate values. Since this scheme is composed of four identical base multipliers, three of those may be used in conjunction with an array of 64 XOR gates and 2: 1 MUX cells, to form a single precision fault tolerant multiplier.
IMPLEMENTATION RESULTS
A 64-bit multiplier has been designed using TSMC 0.18 l m , 6 metal layer technology. The design features four 32-bit Booth-Recoded Wallace-Tree base multipliers and two 64-bit carry look ahead adders. Additionally, a standard 64-bit multiplier has been developed as a benchmark comparison against the reconfigurable structure. 
CONCLUSIONS
The framework for a reconfigurable multiplication architecture has been provided, along with the performance results of a 64-bit implementation. The proposal highlights include a 25% and 18% decrease in average net length and total wire length respectively, resulting in a 60% reduction in crosstalk induced delay. And although there is a 27% increase in the number of components resulting in 22% increase in cell area, the regularity of the cells creates a 9% decrease in overall area.
It has been shown that a multiplier with a single level of recursion demonstrates similar if not superior results in terms of latency, size and power to a standard coiumu compression multiplier of equivalent size. Furthermore the recursive structure is less prone to interconnect coupling effects due to the smaller base multiplier structures. In addition the proposed structure lends itself to advanced applications such as truncated multiplication, and low power techniques for highly correlated operands.
Multi-issue architectures are slowly beginning to dominate the field of new high-performance proceinsors with new emphasis on compatibility. With general-puipose processors providing competition for dedicated DSP processors, there is optimistic potential for the future of reconfigurable architectures for DSP processors.
