This paper presents a non-monolithic top-down reconfigurable multiplier suitable for embedding in an FPGA structure. It is constructed of four individual partitions that can operate as separate multipliers but also concatenate to form a superior multiplier with increased precision and sign handling ability. The number of possible operation modes is limited in order to keep the reconfiguration overhead low. A small set of control signals determines behavior and mode selection. Inactive partitions are disconnected from the supply to save power. EMMA (Embedded Multi-precision Multiplier Array) can compute signed two's complement numbers at up to 32 × 16-bit precision when all partitions are active and concatenated, or up to four separate 16 × 8-bit multiplications running simultaneously.
INTRODUCTION
The potential for reconfigurable arithmetic augmentation becomes evident by looking at current trends in DSP design, which are more and more using FPGAs. Implementing the multiplication in FPGAs can create a bottleneck in arithmeticintense applications, since e. g. most filter designs and FFT implementations rely on efficient multiply-accumulate tasks [1] . An FPGA multiplication fully based on look-up tables (LUTs) is very area-inefficient. Also, workarounds such as programming RAMs using a partial reconfiguration scheme [2] are rather difficult to handle, while multipliers with a digit-serial/parallel scheme [3] are only suitable for moderate word-lengths. On the other hand, having optimized arithmetic elements available side-by-side with configurable logic and routing in the FPGA enables a DSP system designer to use it like a construction kit for filter structures. This is why e. g. Xilinx provides their FPGA devices from the Virtex-II Pro generation up with hard-wired multiplier elements in the fabric, and newer generations Virtex 4 and 5 come with specialized DSP slices incorporating further arithmetic extensions. The size of the multiplier in the Virtex-II Pro was 18×18-bit -probably for reasons of floorplanning, since a column-wise placement of multiplier blocks replaced embedded block RAMs. In the following generation this size was kept, possibly for the sake of leaving major parts of the synthesis software and mapping tools unchanged.
The downside with the 18×18-bit multiplication is the high granularity which may implicate hardware overhead for some scenarios. Due to the sign handling circuitry, calculating the result of A × B with both operands at full 18-bit precision is not possible, and the mapping tool is forced to cascade extra logic to complete the structure. Synthesis experiments show that the cost functions used in Xilinx place & route tools opt for embedded multipliers instead of LUT mapping, starting from an operand size of 3×4-bit. That means at this size and above, the embedded multiplication is more efficient than the standard LUTbased implementation. This leads to the insight that an approach with lower multiplication granularity and an element size of 8, 12 or 16-bit is equally beneficial. Thus, we have designed a top-down reconfigurable multiplier array that is capable of multi-precision and suitable for enhancing the structure of FPGAs. Important applications that can benefit from multi-precision arithmetics include e. g. neural network-based classifiers [4] and the evaluation of nonlinear functions [5] .
RELATED AND PREVIOUS WORK
High-performance multipliers for DSP operations such as filters, Fast Fourier or Wavelet transforms, convolution, Euclidean distance and many more applications often have low area and low power usage as design optimization goals. Ongoing research targets fixed-width multipliers with increased area efficiency due to reduced precision. The principle is to implement only a part of a shift-and-add-based partial product addition array and accept a 50% precision loss against a full A × B = P multiplication where A and B have n-bit and P has 2n-bit precision. This is achieved through truncation and also adaptive error compensation with multiple degrees of freedom for input correction vectors [6, 7] . In general, the less bit positions in the addition scheme are sacrificed, the more precise and less error-prone the multiplication result is. However, if one wants to avoid any truncation error, these single-precision fixed-width multipliers are not suitable. Another multiplier design [8] uses multiplexer circuitry implanted into the partial product addition scheme in order to bypass complete rows, in case the respective multiplier bits are zero. This design computes unsigned numbers and claims to achieve better power reduction with lower hardware overhead compared to row-bypassing. However, the area overhead for the bypassing circuit is still about 20%.
Configurable Word-length Multiplication
In contrast to those designs looking at fixed-length multiplication, there are approaches concerned with creating a reconfigurable multiplier for input operands of variable size. Some multipliers such as [9] apply a power-aware scalability to a multiplier for low power consumption in frequently changing environments. The input word-length is detected and unused parts of the multiplier circuit are turned off via corresponding control signals. Other proposals such as [10] and also our own previous work, as briefly reviewed in the following section, specifically target embedding reconfigurable multiplier blocks into FPGA structures, in order to create multiplier arrays of variable size or precision, respectively.
Our Previous Work -Bottom-up Reconfiguration
The principle idea for our bottom-up reconfigurable multiplier array [11] was to use separate blocks that are fully functional multipliers themselves. Through the concatenation of m × m uniform building blocks, each having n × n each, a superior (m × n) × (m × n) multiplier can be assembled. This approach can also be applied to non-square arrays, thus realizing arbitrary operand scheme sizes. There can either be a high parallelism when many of the elements are working separately and concurrently, or a higher precision when several (up to all) of the elements are concatenated to form a larger multiplier array. In every constellation, the number system (unsigned or two's complement 1 ) can be chosen individually. In order to achieve that, the multiplier modules are equipped with data exchange interfaces as well as multiplexers accepting control signals for configuration [11] . Intermediate data transport is realized through the island-style FPGA architecture's interconnect and routing scheme.
For two input operands A and B expressed in two's complement number format A = −a n 1 −1
j=0 b j 2 j , the product is written as
By rewriting (1), we get an alternative expression for the product as follows:
This rewritten expression for P is the base of a modified Baugh-Wooley operand scheme [12] . The corresponding partial products array features a correction term and partial product inversion and is given in Fig. 1 on the poster 2 , as well as Fig. 2 showing the multiplier structure in which the overhead compared to a conventional multiplier is highlighted. Our implementation can be configured via control signals to accept unsigned or two's complement numbers, or to act as a sub-block of a superior multiplier, with its sign handling functionality set according to the position in the array. Correction terms according to the Baugh-Wooley algorithm and partial product inversion at the required positions are integrated into the multiplier element.
[11]
EMMA -AN APPROACH WITH TOP-DOWN RECONFIGURATION
As opposed to perform a bottom-up reconfiguration to realize a multi-precision multiplier, we now apply a top-down approach in this work, namely design a large multiplier that is not monolithic, but introduces data exchange interfaces for multi-precision multiplication. In contrast to the parallel/parallel 64-bit design presented in [13] that can execute one 64-bit, two 32-bit, four 16-bit or eight 8-bit unsigned multiplications, we propose to divide the multiplier block into four separate non-identical partitions, which each one being a separate multiplier element itself, but with different features such as special sign handling cells or input word selection. The four partitions can compute in parallel and/or form various combinations for higher precision. This idea aims for medium granularity with limited reconfiguration options to keep the overhead and thus the efforts for the control circuitry low [11] . In addition, not all possible concatenations are supported, so that the original size of the multiplier must be chosen to match the maximum wordlength required in a certain target application. Partitions vary in overhead (00 at 5%, 11 at 15% against the unsigned case), word-lengths shown exemplarily in 01.
Our new approach EMMA (Embedded Multi-precision Multiplier Array) suggests a 32×16-bit non-square array multiplier with four partitions of 16×8-bit each as depicted in Figure 1 . These are positioned next to each other to avoid routing delay. If the application's precision requirements exceed 32×16-bit, the surrounding FPGA configurable logic blocks must accommodate the extra bits. In order to control and reconfigure the data flow, carry and input signal paths are opened and multiplexer circuitry is implanted in a way as if the original multiplier was divided by two long cuts in the middle, indicated by the dashed lines. Special cells in the parallel/parallel array (represented by the shaded areas in the partition blocks) realize the Baugh-Wooley sign handling scheme described in [11] . These are only required in the perimeter positions of the structure to compute signed two's complement numbers. Registers store the computation results and a multiplexer is used to choose between the direct and the registered output used for pipelining. A combination of sign handling and pipelining is also possible. Data transport to and from EMMA is realized through the islandstyle FPGA routing resources. The input operands A and B are separated in their most significant (subscript M) and least significant (subscript L) sub-words with word-lengths as follows: Fig. 3 on the poster shows a schematic view of EMMA with actual and alternative inputs A, B andÃ,B, respectively, and the four partitions' outputs P 00 through P 11 .
The modes A-I mentioned in Table 1 on the poster can operate in parallel, as long as there is no overlap in partition usage. The following Boolean function (4) accounts for that, where ⊕ stands for exclusive OR and · for AND.
For example, the combination B·F·G would indicate that the lower half (B) of the multiplier operates as a concatenation of partitions 10 and 11 at a precision of 32×8-bit (signed, with registers), while the upper two partitions 00 and 01 work simultaneously: 01 (F) and 00 (G) both operate at 16×8-bit precision, and both result words can be registered.
Power Considerations
According to [9] , it is very important not to "underchallenge" a multiplier, because for example an 8-bit multiplication on a 16-bit multiplier can lead to significant inefficiencies in power due to unnecessary signal activity and switching. Typically, as described in [6] and [7] , the general approach to make fixed-width multipliers in integrated circuits more power efficient is to sacrifice precision through truncation, e. g. compute an n-bit instead of a 2n-bit product from multiplying two n-bit numbers and utilize a special circuit to replace the truncated n-bit LSB part of the result. In contrast to that, our multi-precision multiplier can help to save power while still being able to provide the full wordlength output. When it is not operating at its full extent, the supply voltage can be switched on and off for each partition when an appropriate switching transistor is used. When unused partitions (marked with an • in the second column of Table 1 on the poster) can be cut off from the supply network and thus be deactivated, the number of available modes for each partition is incremented and thus also the number of control signals. If all possible options for each of the four partitions are considered, including shut-off, sign handling, registering and combinations of these, the total number of different configurations for the multiplier is 146. Of course, the choice of providing these options depend on the target application(s) and the system design.
Upgrades and Special Applications
We have shown in [14] that a Booth recoding scheme is beneficial for bottom-up reconfigurable signed multipliers, both in terms of multiplication speed and also hardware complexity, thus the next step would be a Booth-powered variant of EMMA and to evaluate the differences in configuration overhead. In general, having a reconfigurable multiplier array such as EMMA available, some concrete utilization ideas and promising extensions emerge. A shift operation is evidently identical to multiplying the data word by powers of two. Thus, the realization of a shift operation without dedicated hardware is very easy to obtain by just using a multiplier and manipulating one operand accordingly. In particular, when the multiplicand is to be shifted k positions, the multiplier has to have a '1' at the k-th bit. A one-hot encoder translates the desired shift step number into the corresponding bit position of the multiplier operand, for example if a bit shift of 5 positions is required, the value 5 would be mapped to 0001 0000 which would scale the multiplicand by the factor 2 5 = 32. This encoder can be implemented in a look-up table [15] . In order to realize this shifter functionality, EMMA can be expanded to have the one-hot encoder on board that replaces one of the input operands with the corresponding shift factor.
CONCLUSION
In this paper, we have discussed a new contribution to the research area of reconfigurable hardwired multiplier blocks to be embedded in FPGAs. Compared to previous bottomup reconfigurable multiplier arrays, our new approach called EMMA is a medium-grain multiplier block that can be partitioned in four separate sub-blocks that are fully functional multipliers themselves. These partitions form various superior multipliers in different configurations using a small set of control signals, providing two's complement sign handling and output buffering, respectively. Thus, EMMA can help to realize multi-precision multiplication with parallel operation, and power can be saved by turning off unused partitions. Potential applications that can benefit from this reconfigurable embedded multi-precision multiplier array include complex multiplication.
