The introduction of asymmetric embedded multiplier blocks in recent Xilinx FPGAs complicates the design of larger multiplier sizes. The two different input bitwidths of the embedded multipliers lead to two different shifting factors for the partial product outputs. This makes even the most straightforward multiplier design less intuitive. In this paper, we present a methodology and set of equations to automatically generate the Verilog for a multiplier using asymmetric embedded multiplier cores. The presented technique also uses intelligent rearrangement of the multiplier block outputs into partial product terms to reduce the overall delay of the circuit. Multipliers created with our generator are faster and use fewer DSP blocks than those created using Xilinx Core Generator or by simply using the '*' operator in Verilog. It also uses fewer LUTs than those created using the '*' operator. Finally, the presented generator can create multipliers larger than possible with Core Generator, and is limited only by the number of available embedded multipliers.
INTRODUCTION
Multiplication is an important arithmetic function for many applications [1] [2] . For some time, FPGAs have included hard embedded multiplier blocks to improve multiplier performance [3] [4] [5] [6] [7] . However, the multiplier size required for a particular calculation may not match the particular size of the embedded multiplier blocks. In this case, we "compose" a larger multiplier out of the smaller embedded blocks [8] [9] . This process is similar to "long multiplication"-the method generally used to perform multiplication by hand (Figure 1 ). Each decimal digit of one operand is multiplied by each decimal-digit of the other, but only one digit-product can be computed at a time. Digit-products for a particular decimal-digit of the lower operand are aggregated as they are computed; the "tens" digit of the digitproduct is carried to the next position. It is then summed with the "ones" digit of the next digit-product. The complete result of processing one decimal-digit of the lower operand against all decimal-digits of the upper operand is a single partial-product. The definition of these terms as used in this paper is given in Figure 2 . For the example of Figure 1 that has three digits in the upper operand and three digits in the lower, we have three partialproducts, P 0 , P 1 , and P 2 , each created from three digit-products. 
Decimal-digit: One decimal digit [0-9]
Digit: A grouping of binary bits, equal in bit-width to one of the inputs of the embedded multiplier block. Digit-product: The result of multiplying two digits or two decimal-digits together. Partial-product: Grouping of one or more digit-products into a partial result. For hand-multiplication, partial-products generally group the digit-products created by the same loweroperand digit. Symmetric multiplier block: An n × n multiplier, where the inputs have equal bitwidth. The output is 2n bits.
Like long-multiplication, we can divide our large binary operands into sets of digits that we can multiply using the smaller embedded multiplier blocks. If several multiplier blocks can be used to generate digit-products, waiting for the carry-out of the previous digit-product before combining it with the lower part of the next digit-product to compute the next digit of the partialproduct (as in long-multiplication) creates a long critical path. Instead, with enough multiplier blocks, all digits from the top operand can be simultaneously multiplied with all digits from the bottom operand, creating a set of digit-products equal in number to the product of the digit counts of the operands.
The "divide and conquer approach" to construct large multipliers out of a set of small, symmetric multiplier blocks is a well-known technique [8] [9] [10] . Each digit is equal in bit-width to the input size of the multiplier blocks. For example, if the operand size for the needed large multiplier is 48 bits, with 8×8 multiplier blocks the operands would each contain 48/8 = 6 digits, and each digit would be 8 bits wide. If the operand size is not an exact multiple of the digit-size, it is zero-padded to fill the most-significant digit. Next, each digit from one operand is multiplied with each digit from the other operand, creating the set of digit-products. We will use this term throughout the paper to refer to the outputs of the embedded multipliers. These digit-products are shifted to the correct position depending on the position of the source digits in the input operands. The digit-products are then summed, generally using a tree of adders. When the multiplier blocks are symmetric, building such an adder tree is straightforward. Section 2 discusses this issue in more depth and describes previous work to develop alternate summation methods that further improve performance.
However, some FPGAs now have asymmetric multiplier blocks [5] [6], where the multiplier block inputs (and thus the digit-sizes used for each operand) differ in bit-width. This complicates composable multiplier design; the shifts of digit-products that make up each partial product differ from the shifts of the partial products relative to one another. Furthermore, the bitwidth of the digit products is not necessarily a multiple of either of these shift values. Section 3 presents examples and a discussion of the problems of composing large multipliers from asymmetric embedded multipliers. We also present an automated method to generate HDL for such multipliers. Section 4 compares the quality of the multipliers created by our generator to those created by the Xilinx tools, and demonstrates that our multipliers are faster and use fewer of the expensive embedded multiplier blocks. Figure 3 illustrates creating a larger multiplier from nine parallel (smaller) symmetric n × n multiplier blocks [8] [9] [10] [11] [12] . The bitwidth of each operand (X and Y) is three times the input bitwidth of the multiplier blocks. This could happen, for example, if each operand were 24 bits wide and we used 8×8 multipliers, or if each operand were 6 bits wide and we used 2×2 multipliers. The number of required multipliers is the product of the digit counts of the two operands, which in this case is 3×3 = 9.
COMPOSABLE MULTIPLIERS WITH SYMMETRIC MULTIPLIER BLOCKS
Like long multiplication, we show the digit-products for digit Y 0 above the digit-products for digit Y 1 , which in turn are above those for digit Y 2 . Within the set of digit-products for a given lower-operand digit, each successive term is shifted by one digit position (n-bits) to the left. Also, each group of digit-products for a given lower-operand digit is shifted by one digit position (n-bits) relative to the digit-product group for the previous lower-operand digit. These shifts are based on the relative locations of the digits in the input operands. Note that because the embedded multiplier blocks are symmetric, the shift of digit-products within a group is equal to the relative shift between groups. The digits within a group are shifted by the digit size of X (n-bits), and the groups are shifted by the digit size of Y (n-bits). Finally, note that within a partial-product grouping, the upper half of one digit-product overlaps exactly with the lower half of the next digit-product.
After we create these digit-products, we can treat each digitproduct as a separate partial product. The partial products are combined using an adder tree to produce the final result (labeled Z in the figure) . The early stages of the tree may first combine digitproducts from the same group and then sum the results for each group. However, the depth of the complete tree is still no less than , where D is the number if digit-products.
The adder tree depth can be reduced by grouping adjacent but non-overlapping digit-products [8] [9] [10] . This grouping requires only concatenation, not addition, and thus no added computation latency. Figure 4 shows the digit-products from Figure 3 rearranged in this method, with widest groupings listed towards the top. This creates a total of five partial-products, compared to the nine in Figure 3 . Some digit-products cannot be grouped with any others because of overlaps. An adder tree that processes the nine individual digit-products shown in Figure 3 requires four levels; an adder tree that processes the five combined partialproducts from Figure 4 requires only three levels. The size of the adders needed for the tree is not overall increased, since the original adder tree of Figure 3 would have required adders just as wide in its later levels. The benefit is that some of the digitproducts are "summed" by concatenation instead of actual addition, avoiding adder levels and carry chains in those cases. 
COMPOSABLE MULTIPLIERS WITH ASYMMETRIC MULTIPLIER BLOCKS
The above prior work applies the divide-and-conquer strategy only to problems with symmetric multiplier blocks. Some suggest expanding the technique to asymmetric multipliers blocks where one of the two multiplier block inputs is a multiple of the other by using multiple blocks to form a "square" multiplier, and then building the complete multiplier from a set of (pre-composed) square multipliers [9] . Or, the use of asymmetric multiplier blocks is suggested, but not discussed [8] . However, the embedded multiplier blocks in FPGAs may not have one input size that is a multiple of the other; in fact, the current FPGAs that use asymmetric multipliers use a 24×17 multiplier size [5] [6] . Thus a more general technique is needed to build composable multipliers from asymmetric multiplier blocks. In this section we present our proposed technique, which we then evaluate in section 4. Figure 5 illustrates applying the divide-and-conquer strategy using asymmetric embedded multiplier blocks. In this example, the X operand is divided into two n-bit digits, and the Y operand into three m-bit digits. X and Y may each be 6 bits wide, with X divided into two 3-bit digits, and Y divided into three 2-bit digits, for use with a 2×3 embedded multiplier block. As before, the number of required multiplier blocks for parallel multiplication of the operand digits is the product of the digit counts of the two operands. In Figure 5 , 2×3 = 6 embedded multiplier blocks are required to produce the six digit-products. Unlike in the symmetric case, X 1 Y 0 and X 0 Y 1 do not exactly overlap positions; X 1 Y 0 is offset by n bits, whereas X 0 Y 1 is offset by m bits. More specifically, the digit-products within a group of digit-products produced for the same lower-operand digit are offset from one another by n bits (the digit size of the upper operand), but the inter-group offset is m bits (the digit size of the lower operand). To reduce the depth of the partial-product adder tree for the asymmetric multiplier block case, we can use a digit-product concatenation technique similar to what was used for symmetric multipliers. Applying this technique to the problem shown in Figure 5 gives the set of partial products shown in Figure 6 . Although initially this shape may appear identical to Figure 4 , the two different shift factors are an important feature. In the symmetric case there may be multiple ways to choose which digitproducts to group into partial products because of the uniform shifting; in the asymmetric case the choices are more constrained, and digit-products within adjacent partial-products in the figure partially overlap. The original set of six partial products given in Figure 5 would require a three-level adder tree; the rearranged set given in Figure 6 requires only a two-level adder tree.
Problem Description and Example

Automatic Generation Method
In this section we present our method for decomposing a large size multiplication based on the sizes of the available asymmetric embedded blocks. The embedded multipliers are n×m in size, where n and m represent the bitwidths of the two multiplier inputs, and n≠m. The multiplier that we create from the embedded cores implements Z = X×Y, where operands X and Y may or may not have equal bitwidths. We separate the process of implementing the X×Y multiplier into three steps: operand decomposition, partial product generation (which includes the concatenation step), and partial product summation. Note that the presented operand decomposition step defines new variables that will be used in the late simp fy th equ s. 
Operand Decomposition
The first step is to decompose the input operands into digits that match the input sizes of the n × m embedded multiplier blocks. Unlike when using symmetric embedded multipliers, we have two possible options to consider for decomposition: we can decompose X by n and Y by m, or X by m and Y by n. To minimize the number of used multiplier blocks, we should choose the decomposition that minimizes the number of digit-products. The number of digit-products is the product of the number of digits in the two operands (C X × C Y ). Thus we compare the two options given in Figure 7 , and choose the C X and C Y pair that give the smallest C X × C Y product. If both options result in an equal product, we choose the option with the smallest C X + C Y sum, because the number of adder tree levels needed will be C X + C Y -1. Next, based on the chosen option, we choose our "upper" operand A and "lower" operand B for generating the partial products, such that B has as many or more digits than A. This simplifies the notation in the generation algorithm. Finally, we set j to be the digit size of A, and k to be the digit size of B (where {k = n, j = m} or {k = m, j = n}, depending on the decomposition chosen). The algorithm for this process is given in Figure 7 .
Partial Product Generation
The multiplier output Z is calculated as shown in Equation 1. The digit-products of this equation can also be represented as a grid, as shown in Figure 8 . Digit-products in the grid are not shown shifted to their exact positions like they are in Figure 6 . The grid highlights different ways of grouping the digit-products. Figure 5 represents a "horizontal" grouping, and Figure 6 represents a "diagonal" grouping. Digits along the same line (horizontal, vertical, or diagonal) are grouped into the same partial product. The horizontal grouping is one that most closely mimics long multiplication. However, the diagonal lines allow the grouping to be done with concatenation of grouped digit-products instead of addition; both horizontal and vertical groupings require addition of the grouped digit-products. Figure 9 shows the diagonal grouping for the general case. We separate the partial products into three regions, the "Upper", "Middle" and "Lower"
The arrangement of digit-products shown in Figure 9 creates three partial product "regions". The Middle region contains all of the "widest" partial-products (those with the maximum number of concatenated digit-products). The Upper contains digit-products on diagonal lines up and to the left of the Middle lines. The Lower contains the digit-products on diagonal lines down and to the right of the Middle lines. The bitwidth of Middle partial-products are identical. The bitwidth of partial-products in the Upper and Lower regions grows (shrinks) linearly. The first Middle partial-product is not shifted; its least-significant bit is position zero. Successive Middle partial-products are each shifted by k bits to the left with respect to the previous Middle partial-product. In the Upper region, the widest partial-product begins at position j; each partialproduct above it is shifted j bits further to the left than the one below. In the Lower region, the least-significant bit of the widest partial-product is shifted k bits to the left of the least-significant bit of the last partial-product in the Middle region. Each successive Lower partial-product is shifted k more bits to the left.
To determine how many partial-product terms should lie in the Upper, Middle, and Lower regions, we use the calculations given in Table I . These equations are based on the calculations performed in Figure 7 ; for example, they require that C A ≤ C B . 
Partial Product Addition
We first sum all partial products in each region together, then sum the results. The number of the terms in each region is related to the sizes of the inputs of the asymmetric multiplier and the resulting number of digits in our operands (Table I ). The total number of partial products is C A +C B -1, which is controlled by the number of digits in operands A and B. The number of partial products in the Upper and Lower regions is controlled by the number of digits of the operand with the smallest digit count (A). The number in the middle region is then calculated by subtracting the number of upper and lower terms from the total number of partial products. The adder tree depths for the Upper and Lower regions is 1 . The adder tree depth for the Middle region is 1 . Then, two more stages are required to merge the regions. It may be possible to further reduce the number of stages by one or two, but this would complicate the generation of the adder trees because of the different shift amounts. We leave this optimization for future work.
When summing two partial products within a region (any region), we take advantage of the fact that their least-significant positions do not align. The sum of two partial products is thus partly a concatenation and partly a sum. In the Middle region, partial products partially overlap (Figure 10 left) ; in the Upper and Lower regions, one will completely overlap another (Figure 10  right) . In either case, the lower portion of the sum that is not overlapped does not need to enter an adder-it can be concatenated with the sum of the overlapping regions. The nonoverlapping part at left both examples of Figure 10 are part of the addition because of a potential carry-out from the overlapping region. Future work will investigate the use of alternate adder types, such as carry-save adders, to further reduce the critical path of our adder trees. The order of processing partial-products depends on the region. In the Middle region, the first stage of the adder tree sums adjacent partial-products. When the number of partial-products is not a power of two, we propagate un-paired partial-products to the next adder stage until they can merge into the tree. For the Upper and Lower regions, we process partial-products from the "outside in", combining the top Upper partial-product with the bottom Upper partial-product, etc. This emphasizes concatenation instead of addition to reduce our critical path [13] . If at any stage there is an unpaired partial-product in the middle of the region, it is passed to the next stage unchanged until it can merge into the tree. The Lower region is processed in the same manner as the Upper region. If the Upper/Lower regions have a shorter critical path, the Upper sum is added to the Lower sum before adding to the Middle. Otherwise, the Upper is added to the Middle, then that result is added to the Lower. Using the approaches described in this section, along with common adder-tree techniques [8] [9], we build the summation tree for our partial products.
RESULTS
We implemented a generator program in MATLAB that, using the two large multiplier operand sizes and the embedded multiplier block input sizes, generates synthesizable Verilog code for the described multiplier structure. We currently only generate combinational structures, although the above techniques could be extended to incorporate pipeline stages. The best location for the pipeline stages would depend on the depth of the adder trees.
Our generator program can target any asymmetric multiplier size, but to compare results of different multiplier design styles, we set the multiplier size parameters to match the asymmetric multipliers in the Xilinx Virtex-5 DSP48E blocks. Each of these is capable of implementing up to a 24×17-bit unsigned multiplication, and optionally also up to a 48-bit addition or accumulation. These adders can be chained through dedicated paths between DSP48E blocks that allow a 17-bit shift. This shift is designed to simplify implementation of symmetrically composed multipliers. Our multiplier designs were synthesized on the Xilinx Virtex-5 XC5VLX155 (speed grade -2) device using the XST tool (version 10.1) with optimization goal set to speed and using normal optimization effort. The device contains 128 of the DSP48E blocks. In all designs, we describe additions using Verilog, and allow XST to choose whether to implement the additions in the DSP48E blocks or in LUT-based logic.
We first test the case in the decomposition step where C XN × C YM = C XM × C YN , and demonstrate that it does in fact matter in that case which decomposition we choose. For this experiment, we implemented a 64×128 multiplier (X=64, Y=128). For n =24, m = 17 we get C XN ×C YM = 3×8 = 24 and C XM ×C YN = 4×6 = 24. Our results (Table II) confirm that we should pick to decompose X by m and Y by n, which we determine by finding that C XN + C YM = 11, and C XM + C YN = 10. We then choose the decomposition to be C XM × C YN as this would result in lesser number of partial product terms and hence, the required number of adders. From the table we observe that by choosing to decompose X by m improves combinational delay by 8.6%, and saves 9.6% of the LUTs Next, we examined the different partial-product generation methods. We compared horizontal, vertical, and our chosen diagonal method for 64×64 multiplication. The results in Table III show that the diagonal regrouping of the terms helps to reduce the overall combinational delay 33.35% and helps in area savings by 61.21% on average compared to the horizontal and vertical groupings.
Finally, we compare our generated asymmetric multipliers (Asym) to several different baselines. First, we compare to the "naïve" (Naïve) approach of just expressing the multiplication in Verilog as Z=X*Y. Second, we use Xilinx Core Generator to create multipliers. Because Core Generator restricts generated multiplier sizes to 64×64 or smaller, these data points are only provided up to that point. Finally, we also implemented a symmetric multiplier generation method (Sym) [10] , and applied it to the tested multiplier sizes. In this case we treat each 24×17 multiplier as a 17×17 multiplier. Although this is clearly inefficient, the purpose of this baseline is to highlight the importance of considering asymmetry in multiplier generation.
We compare each of these multiplier methods on the basis of DSP48E count, LUT count, and combinational delay. All multipliers are generated as combinational-only designs. Future work that adds automated adder tree pipelining will also compare pipelined versions of these multipliers. We test a set of large multiplier sizes where both operands have equal width, ranging from 17 to 128 ( Figure 11 through Figure 13 ). We also test a set of large multiplier sizes where one operand is fixed at 64 bits, and the other varies from 17 to 128 ( Figure 14 through Figure 16 ).
The results show that for all tested multiplier sizes, our generation method (Asym) uses fewer DSP48E blocks than any of the compared methods. Among all multiplier designs that compute digit-products entirely in DSP blocks (instead of using LUTs for some or all digit-products), the Asymmetric multipliers use the minimum possible DSP block count. This is because our generator uses the full 24×17 multiplier size, and because we choose our operand decomposition specifically to minimize the DSP block count. On the other hand, CoreGen breaks down the multiplier using symmetric 17×17 multiplications for most of the partial products. This is because the DSP blocks provide dedicated routing between them that provides a fixed 17-bit shift between chained blocks, followed by a summation of the DSP block product with the shifted value. This is very efficient in terms of LUT usage (no LUTs required for the adders), but requires underutilization of the multiplication capability of the DSP blocks.
The only exception to the use of 17×17 multipliers is at the most significant digit-products that do not require shifting-these can use the full multiplier capabilities, as given in the multiplier example from the DSP48E guide [5] . This explains why the number of DSP blocks required by CoreGen is very similar to what is needed for the Symmetric (Sym) multipliers. The Naïve results are close to that of the CoreGen results in terms of DSP block use; it appears to use a similar methodology.
Our generated Asymmetric multipliers also generally need fewer LUTs than the other designs (except for CoreGen, which does not use any). In a few cases, Naïve uses fewer LUTs, but not by much. Notably, our generated multipliers are faster in all but two cases: the 24×24 Naïve multiplier, and the 32×32 Symmetric multiplier. The results, however, are very close. Based on the resulting resource use, we believe that the Xilinx tools inferred multiply-add units for the Asymmetric multiplier but not for the other two, slowing the Asymmetric design slightly. Using the adders in the DSP blocks can be slower in a non-pipelined design due to increased routing lengths required to bring the other operand to the DSP block. Future work will compare pipelined versions of these architectures to determine if pipelining solves [14] .
For the 64×64 size, our generated Asymmetric multipliers were 72.8% faster than CoreGen, plus we used about 75% of the DSP48E blocks. It should be noted that the limitations of CoreGen did not allow us to test our generated multipliers against larger multiplier sizes than those shown.
One could use the minimum DSP48E count provided by our Asymmetric multipliers without needing LUTs for the summation of digit-products. Because only a fixed shift of 17 bits is supported in the dedicated routing paths between DSP blocks, this would necessitate using routing external to the DSP blocks to accomplish the 24-bit shift required in some cases. However, this approach would significantly increase the latency. First, external routing would be much slower than the built-in chained routing of the DSP blocks (which is limited to only the 17-bit shift). This problem could be remedied if the DSP blocks were modified to provide a configurable 24-bit or 17-bit shift, but this would also increase the size and complexity of the DSP blocks. Furthermore, regardless of external vs. dedicated routing, the resulting chained addition would be much "deeper" than the tree-style addition used in our Asymmetric multipliers. Thus, this solution, although areaefficient, may not be suitable when latency is an issue. Thus it is likely to be a better solution to make use of a small amount of LUTs in our Asymmetric designs in exchange for greatly reduced latency and the faster (but less flexible) DSP blocks.
CONCLUSIONS
In this paper we presented a new automated multiplier generator technique that creates large multipliers out of asymmetric embedded multiplier blocks, as are present in some of the newer commercial FPGAs. Designing a larger multiplier out of smaller multiplier building blocks is more complex for asymmetric multiplier blocks than symmetric multiplier blocks because there are two different shift factors involved (and various combinations of them), and partial products do not line up as exactly. We demonstrated that the decomposition of the two operands must be carefully approached, and that concatenating some of the partial products they enter the adder tree provides significant benefit.
Although our technique could be applied to any sized asymmetric blocks, we demonstrated its benefit by applying our method to the Xilinx Virtex-5 FPGA, which contains asymmetric hard multiplier cores. We compared our generated multipliers with Naïve multipliers (using a single Verilog "*" operator to multiply the complete operands), multipliers created using Xilinx Core Generator, and multipliers created using a previous method designed for symmetric embedded multiplier blocks. In all cases, our generated multipliers using asymmetric blocks had a lower combinational delay. They also used the same or fewer (usually fewer) DSP blocks than all other compared designs. LUT count was equal to or lower than nearly all compared designs apart from the Core Generator version, which does not use any LUTs. However, the as a percent of overall FPGA resources, LUT use of our multipliers is low compared to DSP block use. Our proposed multipliers are therefore both smaller and faster than those created using these other common techniques.
