This paper introduces a heuristic solution to the multiple restricted multiplication (MRM) optimization problem. MRM refers to a situation where a single variable is multiplied by several coefficients which, while not constant, are drawn from a relatively small set of values. The algorithm involves deriving directed acyclic graphs representing multiple constant multiplication obtained for each time step and then merging these graphs into a single MRM graph. For FPGA implementation, the proposed approach results in significant area savings, especially for large problem sizes, and is time-efficient compared to a previous optimum approach using Integer Linear Programming.
I. INTRODUCTION
Many Digital Signal Processing (DSP) or arithmeticintensive applications involve frequent use of multiplication. The multiplication operation is considered to be expensive, as it consumes significant logic and routing resources for implementation. For constant multiplications, it is common to use a combination of shift-and-add operations rather than using full multipliers, to reduce the hardware usage [1] . Since the shift operation can be implemented by wiring, it has negligible cost in a bit parallel implementation. Therefore, the total hardware cost approximately corresponds to the area of the adders required.
For the problem of multiple constant multiplication (MCM), common sub-expression elimination (CSE) can be applied to a set of constant multipliers in order to minimize the number of additions required [2] , [3] . This paper introduces a heuristic approach to a related problem, the recently proposed multiple restricted multiplication (MRM) optimization [4] . MRM refers to a situation where a single variable is multiplied by several coefficients that, while not constant, are drawn from a finite set of constants that change with time. Such a situation arises commonly in high level synthesis tasks due to resource sharing, for example in a folded implementation of a FIR filter [5] or polynomial evaluation [6] , [7] .
In summary, the main contributions of this paper are: • A novel heuristic for the multiple restricted multiplication problem. • FPGA implementation comparisons between the proposed approach, a standard approach, and the optimal approach previously published [4] . The remainder of this paper is structured as follows. Section 2 of introduces the proposed heuristic, Section 3 presents the results compared both a standard alternative implementation, and a previous optimal approach, and Section 4 concludes the paper.
II. PROPOSED HEURISTIC
Existing synthesis approaches to CSE are unable to take advantage of the MRM situation, resulting in the use of expensive general multipliers, as shown in Fig. 1 . Fig. 1 (a) shows a Data Flow Graph (DFG) containing two multipliers with a common input x, and sets of constant multiplicands labelled as {c 11 , c 12 , . . . , c 1T } and {c 21 , c 22 , . . . , c 2T }. The first subscript here refers to the spatial index and the second to the time index, i.e. c it is the value of multiplicand i at time t. A standard implementation using ROMs and multipliers is depicted in Fig. 1(b ). In [4] , it was shown that the MRM problem can be addressed through extending the basic unit of operation from an addition, used in MCM, to an adder/multiplexer combination shown in Fig. 2(a) . It was further demonstrated that for Xilinx-based implementations, the Xilinx Virtex / Virtex-II slice architecture [8] can be used to implement such a basic computational unit with no area overhead compared to the equivalent adder used in MCM.
As an example, assume we have a single variable input x multiplied by two sets of coefficients {165, 132, 32} and {40, 32, 8}. An optimized implementation of this MRM problem can be seen in Fig. 2 (b), and is described below. Fig. 2 (b) is recognizable as a standard MCM solution, containing two addition nodes [9] , and generating the two values 165x and 40x. coefficients, by selecting the behaviour of the nodes from the possibilities shown in Fig. 2(a) .
In general, we may encode an instance of the MRM problem as a T ×C matrix, where T is the number of rows corresponding to time steps and C is number of columns representing distinct outputs. For example, Fig. 2 
The question arises: how many computational nodes are required for a given MRM problem? The authors have previously presented an approach to find solutions to the minimum-cost of this optimization problem by formulating it as an Integer Linear Program (ILP) [4] . However the solution time required for this ILP increases rapidly with the problem size, and in practice this approach is unsuitable for solving MRM problems with more than three time steps or three outputs. This is the motivation for the heuristic presented in this paper.
The proposed methodology involves the merging of several directed acyclic graphs (DAGs), one per time-step, where each node represents an addition operation, into a single MRM graph.
A. Generating MCM Graphs
The first step of the algorithm is therefore to generate the graphs for each time-step. This is multiple constant multiplication (MCM) problem, for which there are many known approaches. Our synthesis system follows the algorithm proposed in [9] , summarized as follows for convenience. 1. Reduce all coefficients to their equivalent odd 'fundamental', by repeated division by two. 2. Remove repeated fundamentals and the unit fundamental, as these does not incur implementation cost. 3. Create a graph with the remaining cost-1 fundamentals as nodes. 4. Try to form remaining fundamentals through pairwise sums of form x + 2 k x or x + 2 k y, where x and y are existing nodes, and k is a positive integer. 5. Repeat (4) until no more possible. In case this procedure does not produce all required coefficients, it is required to introduce additional nodes and repeat (see [9] for details). As a simple example, assume we have a 3 × 3 problem 
B. Graph Merging
The novelty of the proposed approach comes in the merging of these MCM graphs to form a single implementation structure for the MRM problem. One can imagine that finding an MCM graph for a particular time step within an MRM graph is similar to subgraph isomorphism [10] , which involves mapping of a graph into a subgraph of another having the same structure.
However the proposed process is different in that a path of the MRM graph can be mapped onto an edge of the other graph, in such a way that sum of the weight of edges in the path is equal to the weight of that edge. This path corresponds to a path through multiplexers in the MRM implementation.
The process of merging is shown in Algorithm Graph Mapping, but before describing the algorithm, it is instructive to consider an example.
An example of merging the MCM graphs in Fig. 3 is illustrated in Fig. 4 and Fig. 5 . The procedure starts by choosing one of the MCM graphs as the core MRM graph, which is then expanded as necessary. Let us take Fig. 4(a) as the MRM graph L, and and 4(b), as the MCM graph S. These figures illustrate how nodes 1 and 3 of each graph can be matched together with the corresponding node 1 and 3 of the other graph. Consider node 3, where the dashed line in Fig. 4 represents the path 3, 2, 0 , which has the same weight as the edge 3, 0 in Fig. 4 . Similarly, the dotted path represents another mapping, between the path 3, 1 and the edge 3, 1 . The result of graph merging is depicted in Fig. 4(c) , where an extra node 5 has been introduced to account for the unmapped portion of Fig. 4(b) , i.e. node 2.
The process is then repeated, trying to add the MCM graph of Fig. 5(b) to the result of the previous phase, illustrated again in Fig. 5(a) . Nodes 1 and 2 in Fig. 5(b) can be matched to nodes 1 and 5, respectively, in Fig. 5(a) . The final minimized MRM graph is shown in Fig. 5(c) .
Algorithm Graph Mapping, which performs this process automatically, consists of 5 recursive subroutines: Main Map, Map Inedge, Check Start Nodes, Check Weights, and Check Zero Weight. The algorithm takes as input graphs L and S, together with the source node s 0 and l 0 of L and S, respectively. It then proceeds to systematically search the edges of L from every computational node l back to l 0 , and l is said to be 'matched' to a node s when there is a set of paths in the graph L, for which each element can be mapped an edge in the path from s 0 to s. In algorithm Graph Mapping below, if Main Map(s, l) evaluates to TRUE, then we record that node l can be mapped to node s by setting mark node match[s] [l] . Main Map tries to build a correspondence between two nodes l and s in the MRM and MCM graphs, respectively. Algorithm Map Inedge performs mapping a path l to an edge s, the central difference between this procedure and one for subgraph isomorphism. Subroutine Check Start Nodes checks whether each node s and l has reached the graph source nodes s 0 and l 0 . Algorithm Check Weights is called whenever both edges of node l and s has its same predecessor, in order to ensure that the path weight is correct. Finally, Algorithm Check Zero Weight searches an edge of l that has zero weight and then consider its predecessor with s by calling Main Map algorithm. w(inedge1(s) ))) else if pred1(l) = pred2(l) && pred1(s) = pred2(s) do return Check Zero Weight(w(inedge1(l)), w(inedge2(l))) else if pred1(l) = pred2(l) && pred1(s) = pred2(s) do return Check Weights(inedge1(l), inedge2(l), inedge1(s), inedge2(s)) else return FALSE end if end Map Inedge Input: inedgex(l), inedgex(s), wx Output: TRUE if inedgex(l) can be mapped to inedgex(s), FALSE otherwise begin Sometimes there is more than one node in L that can be matched to a node in S. In order to minimize the latency of the structure when implemented, our algorithm selects the l that has shortest unweighted path from the source l 0 to l. Graph merging is terminated when all graphs have been processed.
Algorithm Graph Mapping
if pred1(l) = pred2(l) && pred1(s) = pred2(s) do return (Map Inedge(inedge1(l), inedge1(s), w(inedge1(s))) && Map Inedge(inedge2(l), inedge2(s), w(inedge2(s)))) (Map Inedge(inedge1(l), inedge2(s), w(inedge2(s))) && Map Inedge(inedge2(l), inedge1(s),w dif = wx − w(inedgex(l)) if w dif = 0 do return Check Start Nodes(inedgex(l), inedgex(s)) else if w dif > 0 do if predx(l) = l0 do return Map Inedge(inedgex(l), inedge1(predx(s)), w dif ) Map Inedge(inedgex(l), inedge2(predx(s)), w dif ) else return FALSE end if else return FALSE end if
III. IMPLEMENTATION RESULTS
The area results for a Xilinx Virtex II [8] implementation are compared with a standard approach using ROMs and general multipliers. Experimental problems of sizes between 3 × 3 and 3 × 40, and with coefficients of 4, 5 and 6 bits are presented in this paper. Although larger problem sizes can easily be solved, this size is appropriate for comparison with the optimal approach in [4] . It should be noted that savings over the ROM/multiply implementation grow with the problem size, and the area cross-over increases with bit size, as seen from Fig. 6 . The results collected show area reductions of up to 37% over the ROM/multiply implementation, but larger problems will result in larger savings. Over the set of results that can also be addressed by the optimal approach in [4] , the heuristic area is between 25% and 79% greater than the optimal value. This loss in area performance is because the structure of the MCM graphs is fixed, without reference to coefficient values in other time slots. The execution time of small 4-bit problems of the proposed procedure to the optimal ILP from [4] is shown in Table I . For the biggest problem (3 × 3) that can be reasonably computed using ILP, it takes 61:34 hrs/mins, whereas the proposed approach uses 2.84 secs. This reduction in time allows the proposed procedure to be embedded within a highlevel synthesis flow.
IV. CONCLUSION
The work presented in this paper can be considered as an extension of [4] which a novel heuristic algorithm has been proposed. It is based on merging graphs which involves partial mapping directed acyclic graphs representing multiple constant multiplication obtained from each set of time-step coefficient. This paper seeks to provide a framework for future research on the problem of multiple restricted multiplication for FPGA implementation. There are many ways in which the heuristic approach presented in this paper could be taken forward.
The algorithm for the MRM problem on which we have concentrated has dealt with the operation of addition. It will be simple in the future work to incorporate subtraction into our algorithm. Indeed, each node of MRM graph could be different operators, for instance, adder-multiplexer (in the case of this paper), adder-subtractor and subtractor-multiplexer. Future research could also develop FPGA hardware implementation for the different types of node. Some interesting steps in this direction have recently been made by Turner [11] . Since delay is not explicitly targeted, it may be useful in incorporating some function to improve the path delay.
