In this paper, we present a simultaneous resource allocation and binding algorithm for FPGA power minimization. To fully validate our methodology and result, our work targets a real FPGA (320So) with better circuit speed (1600) compared to a traditional resource allocation and binding algorithm.
INTRODUCTION
The basic problem of high-level synthesis is the mapping of a behavioral description of a digital system into an RTL design consisting of a datapath and a control unit. A datapath is composed of three types of components: functional units (e.g., ALUs, multipliers, and shifters), storage units (e.g., registers and memory), and interconnection units (e.g., buses and multiplexers). The control unit is specified as a finite state machine, which controls the set of operations for the datapath to perform during every control step. The high-level synthesis process mainly consists of three subtasks: scheduling, allocation, and binding. Scheduling determines when a computational operation will be executed; allocation determines how many instances of resources (functional units, registers, or interconnection units) are needed; binding binds operations, variables, or data transfers to these resources. Traditionally, people are more concerned with area and power of functional units and registers. As technology advances, the area and power of multiplexers and interconnects have by far outweighed the area and power of functional units and registers, especially for FPGA architectures. Studies show that interconnects contribute 70-80% of the total area [27] and 75-85% of the total power [19] [20] in FPGAs. Multiplexers are particularly expensive for FPGA architectures. It is shown that the delay and power data of a 32-to-I multiplexer are almost equivalent to a 18-bit multiplier in 0.lum technology in FPGA designs [6] [7] . In general, smaller number of functional units or registers allocated but with larger number of wide multiplexers and larger amount of interconnects may lead to a completely unfavorable solution for both performance and power. To tackle this increasingly alarming problem, it will require an efficient search engine to explore a sufficiently large solution space considering multiple constraining factors, such as resource allocation and binding, MUX generation, and interconnection generation, for optimizing performance or power, or study the tradeoff between them.
Jason Cong, Yiping Fan, Zhiru Zhang
Computer Science Department University of California, Los Angeles {cong, fanyp, zhiruz}gcs.ucla.edu
Although low-power high-level synthesis for ASICs is an old topic, high-level synthesis for FPGA power minimization has not been widely studied. We are only aware of one previous work [7] , where the optimization goal is to minimize power of FPGA designs under performance/latency constraints. The authors adopted a simulated annealing-based algorithm, which carried out high-level synthesis subtasks simultaneously. However, the delay model in [7] did not consider multiplexer delay, which could represent a significant portion of the critical path delay for FPGA chips. Also, [7] only worked on data-flow graphs (DFGs), and it did not model existing commercial FPGAs. In this work we present a novel design space exploration engine, xPlore-Power, for FPGA power minimization. We concentrate on resource allocation and binding tasks because they are the key steps to determine the interconnections during high-level synthesis. To fully validate our methodology and result, we target a real FPGA architecture Altera Stratix architecture [2] , which includes generic logic elements, DSP cores, and different types of memories, etc. We design a high-level power estimator for this architecture and verify that its power estimation result is very close to that reported by Altera's gate-level power estimator Quartus II PowerPlay Analyzer [1] . We form, propagate and prune binding/allocation solution points guided by our power and delay estimation. During this process, we pay attention to interconnects and multiplexers to control their power consumption and delay. Eventually, we generate a design solution curve, which can provide ideal solution points with low power and high performance. The rest of this paper is organized as follows. In Section 2 we present related work. Section 3 provides definitions and problem formulation. Section 4 presents CDFG simulation, and power and delay estimation. Section 5 presents detailed description of our xPlore-Power algorithm. Section 6 presents experimental results, and Section 7 concludes this paper.
RELATED WORK
There is extensive literature on binding and allocation problems for high-level synthesis [10] [11] . The previous work can be roughly categorized into two major groups. The first group solves register binding and functional unit binding separately. Representative algorithms include clique partitioning [28] , weighted bipartitematching [15] , network flow [5] [12] , and k-cofamily [6] . The challenge for these approaches is how to achieve global optimization. The second group tries to address the global optimality and performs simultaneous functional unit and register binding. Representative algorithms include simulated annealing [7] [8] [18] , simulated evolution [21] , and ILP (integer linear programming) [13] [26] . Since the subtasks of high-level synthesis are highly interrelated, simultaneous optimization approaches try to consider all the involved optimization parameters together and explore the combined solution space for overall better results. One concern for these algorithms is their scalability towards optimizing large designs. There is some work that carries out simultaneous optimization one control step at a time [16] [23] . Although this approach may have better runtime, it could lose the global optimization opportunity 1-4244-0630-7/07/$20.00 c 2007 IEEE.
5C-4
because it has to commit to a binding solution for each control step, which only represents the local optima. Most of the work mentioned above is for data-dominated behaviors, normally found in digital signal processing and image processing applications. For control-flow intensive behaviors, frequently found in network-centric systems, different optimization techniques are required to handle branch and loop conditions. There are mainly two approaches to address the hierarchical structure in the design. One approach is to process each basic block separately (there are no conditions and loops in a basic block, thus an easier problem to solve), and then handle the control flow between these blocks to reduce cost or improve performance [22] [30] . The other approach is to optimize the whole design directly on top of an internal representation (either hierarchical or flattened) for both datapath and control flow [14] We use a two-level CDFG representation for our input design. The first-level CDFG is a control flow graph (CFG). Each node corresponds to a basic block. The edges represent the control dependencies between the basic blocks. Each basic block contains one operation producing the control signal. If there are more than two successors, i.e., if-then-else or switch statements, the labels on the control edges indicate the values for the respective branches to be taken. A back control edge indicates that there is a loop between the source basic block and the destination basic block. The source basic block and the destination basic block of a back edge can be the same, which indicates that the loop only crosses one basic block. At the second level, each basic block has a pure data flow graph (DFG) representation, which contains a set of operation nodes and edges (dataflows) that represent data dependencies among operation nodes. Figure 1 shows one example. After scheduling, each CDFG has a corresponding STG to hold its scheduling result.
Problem Formulation
High-level synthesis started from a STG essentially is a resource allocation and binding (or sharing) problem, i.e., determine the numbers of functional units and registers, and share functional units among compatible operations and registers among compatible dataflows. These optimization steps have dramatic impacts on the final design quality. Careless allocation and binding will result in unaffordable interconnection resource and multiplexer usage (multiplexers are used to route data and control signals in the design), dropping down the final circuit frequency and increasing the total power. Unfortunately, binding for optimizing interconnection is known to be an NP-hard problem [24] . Even binding in a general STG to minimize resource counts is a difficult problem. Meanwhile, minimizing a single objective number, e.g., interconnection unit, functional unit, or register count, can not guarantee high design quality because these design metrics are interrelated. Therefore, instead of using resource count as the objective function, we use realistic measurements, namely performance and power, to guide our optimization. Performance is usually measured as the latency of the execution, i.e., the product of the execution path-length and the cycle time. Since the path-length is totally determined by the scheduling and thus fixed in the STG, we need only care about the frequency the final design can achieve. Therefore, our synthesis problem can be formulated as follows: Given: A CDFG Gand its STG G' Tasks: construct a datapath architecture, in which every functional unit is bound to a set of operations, and every register is bound to a set of dataflows. Objectives: maintain behavior correctness and optimize power and performance for the design on a target FPGA.
POWER AND DELAY MODELING
To efficiently search the solution space during resource allocation and binding, we need a fast and accurate high-level power and performance estimator to guide the process. We first present an efficient switching activity calculator using CDFG simulation. We then present our power characterization method for one type of commercial FPGAs Altera Stratix FPGAs [2] . We would like to emphasize that similar method can be applied to other types of FPGAs from other FPGA vendors as well. Finally, we present our resource characterization method to estimate the area and speed of different functional units and multiplexers for Stratix.
CDFG Simulation and Switching Activity Estimation
We carry out test vector-based CDFG functional simulation. The simulation process is iterative. For each iteration, a set of test vectors arrive on the primary inputs of the CDFG. These values will follow the control and data flows in the graph and propagate through the graph until they reach the outputs of the CDFG. Then, another set of vectors arrive for the next iteration. During the propagation, the data get operated on the operators within the basic blocks and then passed on to branches or loops determined by conditions. Data can also be loaded from or stored into the memories. During this simulation, we can profile the CDFG and collect useful information for calculating switching activities, block visiting probabilities, worst-case latency, etc.
5C-4
For switching activity calculation, we extend a method published in [4] , which performs simulation just once at the beginning and computes switching activities for any legal binding without repeating simulations afterwards. We add loop support in the algorithm. The handling for operations not in loops is the same as the method in [4] .
Let (PrI -PI2 ... --PR) be a sequence of stimuli enforced on the primary inputs of the CDFG G. By performing functional simulation on G, with primary input stimulus RI (1 < j < K), we can obtain input bit vector IiJ for operation Oi (1 < Suppose all of these operations are in the same loop with a loop iteration upper bound B.1 We define Ii j(') to represent the input bit vector for operation Oi when the simulation takes primary input stimulus RI and reaches loop iteration x (1 < x < B) for Oi. The toggle count between CiJ(Oi, 0i,+) and Cin(ON, 01) under this primary-input stimulus sequence, is then defined as follows: Table 1 shows some details of our characterization. We use Fmax = 100 Mhz and toggle rate (switching activity) = 100% for the purpose of power 1 We handle cases when these operations are not in the same loop as well.
Details are not shown due to space limit. When we reach node 1, we know there will be one multiplier in the solution space. When we reach node 2, there will be two cases: ( 1, 2} or ((1, 2)}.
( 1, 2} means that 1 and 2 occupy two different multipliers, and ((1, 2)} means that 1 and 2 share the same multiplier. Each case represents one solution point for the design processed so far. When we reach node 3, we know that there have to be two multipliers in the design because 2 and 3 are not compatible. The possible solution points will be {1, 2, 3}; {(1, 2), 3}; and {(1, 3), 2}. Similarly, we will process node 4, which will have a total of seven solution points. All of the solution points on node 4 inherit the solution points generated on node 3. In other words, solution points on node 3 propagate to node 4. For example, solution points ((1, 2), ((1, 2, 4), 3 }. We use the longest combinational path in Figure 3(a) as the delay for this solution point (a combinational path starts from one of the registers on top and ends at one of the registers on bottom). We use the estimated power value (Section 4) as the power for the solution point. Notice multiplexers are naturally included in the power and delay calculations. Figure 3( configurations in the partial datapaths before the end of the search, and the final desired datapath may be quite different. Therefore, we do not want to be too strict and greedy during the solution space search procedure. As long as two solution points have different delays, we keep them. Of course, there is an upper limit on the number of solution points we can keep. The more solution points, the larger solution space we are able to search but with larger runtime. We keep M solution points that possess the first M shortest delays explored so far. The far left point in Figure 3(b) represents the final best solution in terms of both power and delay among all the solutions. Different designs will have different curves. It is possible that a smaller power has to be achieved by sacrificing performance, or a smaller delay has to be achieved by sacrificing lower power. Due to space limit, we omit a formal description of the algorithm. We observe that a small number of solution points (e.g., M= 10) can already produce excellent results that are close to those generated through a larger number of solution points (e.g., M= 50). The runtime of the exploration is fast usually within 1 minute with a 2GHz Linux machine.
6. EXPERIMENTAL RESULTS Input Port Static Probability Figure 4 : Estimated and reported power over static probability on benchmarkpr Figure 4 shows the correlation between the estimated power and reported power from another angle. The x-axis is the static probability (the probability of being logic high) on the input pins of the design. Different static probabilities on the input pins imply different switching activity on the inputs. We observe that the two curves are very close and have a similar trend. This shows that our power estimator is sound and able to provide meaningful guidance for the low-power design space exploration. To verify the fidelity of our delay model, we carry out another experiment to compare delays reported from both xPlore-Power and Quartus II. Figure 5 1 and 0 during simulation, compared to the total number of output ports present in the netlist. 3 The switching activity for the input can be calculated by a formula as 2 Pv
Simulation and Power Estimation Analysis
(1 -Pv), where Pv is the probability of input v being 1.
5C-4 using graph coloring. We only examine register allocation and binding here to narrow down the comparison criteria. Our algorithm carries out functional unit and register allocation and binding through xPlore-Power while the graph coloring algorithm will carry out the same functional unit allocation and binding through xPlorePower, but register allocation and binding through its own method. The goal of graph coloring algorithm is to minimize the number of registers during allocation. Graph coloring is a well-known technique to solve binding and allocation problems. To color the lifetime conflict graph of the dataflows with the minimum number of colors is equivalent to finding the best clique partitioning solution on the corresponding compatibility graph of the same dataflows. Some previous work on high-level synthesis used similar algorithms [9] [28]. We compare our result to that generated from a well-known package, ImXRLF, available from [17] . lmXRLF purely works on coloring the graph with the minimum number of colors. It designs a novel search algorithm to find the best independent set in the graph, one by one, according to an objective function, which is related to number of incident edges as well as two layers of neighborhoods. For our case, finding an independent set in the conflict graph is equivalent to finding a binding solution for all the nodes in the independent set (they are all compatible with one another). To make lmXRLF power aware, we estimate the switching activities for the nodes included in an independent set, and change the original cost by considering switching activity factors. We name this variant of lmXRLF as ImXRLF-Power. Table 5 shows the detailed power and Fmax values for each algorithm. On average, lmXRLF-Power only offers a 3% improvement on power consumption compared to lmXRLF. The reason is that it only models the power consumption of the registers and does not model the multiplexers generated for the datapath. On the other hand, xPlore-Power is 32% better on power and 16% better on Fmax compared to lmXRLF. All the data are obtained after placement and routing using Quartus II. 
