Behavior synthesis and optimization beyond the register-transfer level require an efficient utilization of the underlying platform features. This article presents a platform-based resource binding approach based on a Distributed Register-File Microarchitecture (DRFM), which makes efficient use of distributed embedded memory blocks as register files in modern FPGAs. DRFM contains multiple islands, each having a local register file, a functional unit pool, and datarouting logic. Compared to the traditional discrete-register counterpart, a DRFM allows use of the platform-featured on-chip memory or register-file IP blocks to implement its local register files, and this results in a substantial saving of multiplexing logic and global interconnects. DRFM provides a useful architectural template and a direct optimization objective for minimizing interisland connections for synthesis algorithms. Given the scheduling solution and resource (functional units) constraints, two novel algorithms in the resource binding stage are developed based on DRFM: (i) a simultaneous DRFM clustering and binding algorithm, which decides the configuration of DRFM and the assignment of operations into islands with the focus on optimizing global connections; (ii) a data-forwarding scheduling algorithm, which takes advantage of the operation slacks to handle the read-port restriction of register files. On the Xilinx Virtex4 FPGA platform, experimental results with a set of real-life test cases show a 50% logic area reduction achieved by applying our approach, with a 14.6% performance improvement, compared to the traditional discrete-register-based approach. Also, experiments on small-size designs show that our algorithm produces the same number of total connections and Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 ( 
INTRODUCTION
With the advancement of integrated circuit technology, interconnects have had an increasingly large impact on the quality of results (QoR). The shrinking cycle time (combined with the growing resistance-capacitance delay, die size, and average interconnect length) results in the increasing ratio of interconnect delay, especially global interconnect delay, which does not scale well with feature size. The area and power of interconnects have by far outweighed the area and power of functional units and registers. For Field-Programmable Gate Arrays (FPGAs), studies show that interconnects contribute 70 to 80% of the total area [Singh et al. 2002] and 75 to 85% of the total power [Kusse and Rabaey 1998 ]. Multiplexors, which are collections of interconnects without actual computational functionality, except for data routing, are particularly expensive for FPGA platforms. At the register-transfer level, a multiplexor is required when multiple data sources feed into a single port of a resource instance (register or functional unit) in multiple control steps (c-steps). As shown in Figure 1 (a), the behavior of a design is represented in a scheduled DataFlow Graph (DFG). Round nodes represent operations, and numbered rectangles represent variables that need to be stored in registers. Those variables having a same number will share a common register. After resource binding using discrete registers, a datapath is generated, as shown in Figure 1 (b), where two multiplexors are needed to route the dataflows in different c-steps. This is indeed the optimal datapath using discrete registers, if one functional unit is the hard resource constraint. The datapath can be improved if a register-file microarchitecture is applied, as shown in Figure 1 (c), where there is no multiplexor required at all. In fact, we can see that the multiplexors in the first datapath are absorbed and replaced by the dedicated decoder of the 1-write, 2-read-port register file in the second datapath.
However, due to the limited numbers of read and write ports, a centralized register file may not work for highly parallelized applications which require multiple simultaneous data reads and writes. The port numbers of register files are limited because the implementation cost of a register file is very sensitive Fig. 1 . Advantages of register files over discrete registers: (a) a scheduled dataflow graph with register binding indicated on each variable; (b) binding using discrete registers; (c) binding using a register file. Table I . On-Chip RAM Blocks on Virtex4 and Stratix FPGA Devices to its port number. As pointed out in Rixner et al. [2000] , the area and power consumption of a register file grows cubically with its port number. Advanced FPGA devices, such as Virtex IV [Xilinx] and Stratix II [Altera] , are not able to implement register files with more than two write ports in their on-chip memory blocks. Suppose the DFG in Figure 1 were duplicated three times horizontally; a 3-write, 6-read-port register file would then be required, which is very expensive if not impossible to implement. In comparison, a distributed 3-register-file datapath would be more efficient in this case.
The use of distributed register files is further encouraged on platforms with rich on-chip memory or register-file IP blocks. For example, in Xilinx Virtex serial or Altera Stratix devices, memory IP blocks are abundantly distributed on the chips, so that the implementation of register files on them is "free" if they are not used for other data storage and the resource capacity bound is not exceeded. Table I shows data related to memory blocks on Virtex4 [Xilinx ] and Stratix [Altera] . Since we know that the implementation of multiplexors on FPGAs is very expensive [Chen et al. 2003 ] and register files are able to reduce the multiplexor use on such platforms, it is not surprising to see a dramatic improvement in area and performance when using on-chip memories to implement distributed register files.
This article addresses the problem of full utilization of register files during behavior synthesis. In particular, the contributions of this article are as follows.
(1) A Distributed Register-File Microarchitecture (DRFM) is presented as a parameterizable microarchitecture template. DRFM contains multiple islands, each having a register file, a functional unit pool, and data-routing logic. It imposes no specific restriction for scheduling; that is, existing scheduling algorithms may not need to distinguish DRFM from the traditional discrete-register-based microarchitecture. DRFM will be particularly beneficial for reducing the interconnect complexity in FPGA designs. (2) The properties of DRFM are investigated, and specific optimization goals, namely, the total number of interisland connections and the maximum number of feeding-in connections among all islands, are proposed for minimizing interconnect and multiplexor complexity. The complexity of the optimization problem is also analyzed. (3) Given the scheduling solution and the resource constraint, a simultaneous resource clustering and binding algorithm is proposed to decide the DRFM configuration and target DRFM to optimize global interconnects directly. (4) To handle the read-port restriction of the register files, a novel data-transfer scheduling algorithm is presented, taking advantage of operation slacks to minimize the required number of read ports. (5) The resource binding problem based on DRFM is proved to be NP-hard.
For the purpose of optimality study, an Integer-Linear-Programming (ILP) formulation is presented to evaluate the quality of our heuristic.
The organization of the article is as follows. After the discussion on related work in Section 2, the DRFM concept is presented in Section 3. Following the preliminaries, problem formulation, and complexity analysis in Section 4, the DRFM configuration and binding algorithm is discussed in Section 6. Section 7 presents the ILP formulation, and Section 8 discusses how to handle the register-file read-port restriction by using a data-forwarding algorithm. Extensions to CDFGs and operation chaining are discussed in Section 9. Experimental results are presented in Section 10, followed by our conclusions in Section 11.
RELATED WORK AND OUR CONTRIBUTION
There is extensive literature on general binding algorithms in high-level synthesis targeting discrete registers, where functional units access all registers directly and with assumed equal cost [Huang et al. 1990; Stok and Philipsen 1991; Gajski et al. 1992] [De Micheli 1994] [Chang and Pedram 1995; Gebotys 1997] [Chen and Cong 2004; Cong and Xu 2008] . However, the increasing interconnect effect encourages the research on architectures that exploit the physical locality by operating on data close to where it is stored. Jeon et al. [2001] and Kim et al. [2001] proposed a distributed-register architecture, where registers are distributed so that each functional unit can perform a computation by reading/writing data from/to the local dedicated registers. Data transfers between different functional units are regarded as global communications that may take multiple cycles, which decouples communication and computation. Further improvement is shown in Cong et al. [2004] , which presents a Regular Distributed Register (RDR) microarchitecture and an architectural synthesis methodology, with the emphasis on multicycle on-chip communication for synchronous designs. The work in Huang et al. [2007] targets memory-intensive applications and proposes a method for partitioning and scheduling array data and operations into distributed architectures. However, these microarchitectures do not use register files specifically or the on-chip embedded memories for register-file implementation.
One of the early research projects related to register-file architecture in behavior synthesis is the Hyper system [Rabaey et al. 1991] . It proposed using a register file to replace the cluster of discrete registers driving each multiplexor (if feasible). However, the register files are introduced only during the postprocess after traditional binding is accomplished, and the authors did not have a register-file-based microarchitecture in mind before this step. Since a good interconnect structure using discrete registers is not necessarily good for a register-file-based microarchitecture, opportunities for optimizing interconnects and multiplexors may be lost in this approach. A simple example is illustrated in Figure 2 . Suppose we are provided with only 1-write-port register files. Binding solution (c) uses a register file derived from the traditional binding solution (b), and reduces the 3-to-1 multiplexor to a 2-to-1 multiplexor. Note that registers 1 and 2 cannot be grouped into a register file since they have a "write" competition at a same control step, as do registers 2 and 3. The more aggressive binding solution, (d) to (e), uses two distributed register files by taking our approach (discussed in later sections), and eliminates one multiplexor. Note that solution (d) introduces a new register element 4 which replaces register 1 in (a). There has been research that focuses on the synthesis for minimizing register-file (or memory-module) numbers or port numbers, so that the interconnects can be optimized indirectly. In Luthra et al. [2003] the authors propose a hardware/software cosynthesis approach to allocate data to shared memories on FPGAs. Their algorithm uses lifetime information produced by a scheduling algorithm and minimizes the number of memory instances in order to simplify multiplexors indirectly. Kim and Liu [1995a] and Lee and Hwang [1995] discuss scheduling algorithms for minimizing memory numbers or port numbers. All of these approaches focuses on scheduling techniques, while none of them focuses on the resource binding stage.
In Kim and Liu [1995b] , the authors apply interconnect minimization techniques during variable allocation (after operation binding) for datapaths with multiport memory modules. Our approach differs from Kim and Liu [1995b] in that we consider the bindings for operations and variables together in a unified way, targeting an island-based microarchitecture template.
The works introduced previously are all in the field of behavioral synthesis, which generates application-specific FPGA/ASIC designs. For general processors, there has been also extensive research regarding distributed architectures [Farkas et al. 1997; Rixner et al. 1998; Dally and Lacy 1999; Khailany et al. 2001; Seznec et al. 2002; Bunchua 2004] . In Bunchua [2004] , the author propose replacing the centralized register file used in traditional processors with distributed register files and show the impact of different register-file configurations on performance, area, and power. Besides the difference in microarchitecture, such as the data-transfer method among different register files and the organization of each cluster, the fundamental distinction of Bunchua [2004] and our approach is that Bunchua [2004] determines the processor configuration in advance and then maps applications onto it dynamically or statically, while our approach optimizes and configures the whole datapath targeting one specific design.
DISTRIBUTED REGISTER-FILE MICROARCHITECTURE
The essential insight behind many approaches discussed in Section 2 is that communication (or data transfers) should be localized as much as possible so that the interconnect effect is minimized. With a similar insight in mind, we present a Distributed Register-File Microarchitecture (DRFM) for resource binding in behavior synthesis. Figure 3 presents one of the multiple computational islands of this microarchitecture. Each island contains a Local Register File (LRF), a Functional Unit Pool (FUP), and data-routing logic. The LRF plays a key role in an island, since it is used to store the value produced from the internal FUP of the island. The LRF also provides data to the FUPs in this and the external islands. Datarouting logic is, as implied by its name, used for routing data from external islands. The multiplexors on the front of the FUP may be used to select correct data, either from the LRF or data-routing logic, at each control step. Note that these multiplexors, if used, are usually much smaller compared to those in the datapath using discrete registers. Hereafter, let M = {I 1 , . . . , I K } denote a DRFM configuration with K islands. Suppose I is an island; we use LRF(I) and FUP(I) to represent its LRF and FUP, respectively.
In an ideal DRFM configuration, each LRF is restricted to only 1-write-port but there is no restriction on the read-port number. No data replication is allowed in this datapath, that is a variable can only be stored in one register element of a fixed register file during its lifetime. We will investigate how to handle the read-port restriction in Section 8.
DRFM provides many advantages in behavior synthesis. First, it is a semiregular microarchitecture template. Although it has a write-port restriction on each LRF, it provides much more flexibility than the traditional VLIW and DSP architectures because DRFM has no restrictions on data-routing structures and configurations of the LRF and FUP. Those flexible configurations should be determined by the application and synthesis algorithms. Second, DRFM provides a template and specific optimization goals for synthesis algorithms. For example, the data-routing logic should be optimized by any synthesis algorithm targeting DRFMs to minimize the interconnect and multiplexor complexity. Last, modern FPGA platforms are very efficient for implementing DRFMs, given their rich on-chip memory resources. Table I in Section 1 summarizes the total number of memories for Virtex4 serial. We can see that memories are abundant and well distributed on the whole chip. Given a reasonable DRFM configuration, where most dataflows happen on the intraisland interconnects, a good physical synthesis tool will place all the resources belonging to a single DRFM island physically together. Therefore, intraisland interconnects do have better timing than interisland interconnects. The design and optimization goal of our proposed algorithm in Section 6 will be based on this property of DRFM. Experiments in Section 10 show that our assumption conforms to the reality.
PRELIMINARIES AND DRFM BINDING PROPERTIES

Preliminaries
The behavioral kernels of an application to be synthesized are represented as DataFlow Graphs (DFGs) . A DFG is a Directed Acyclic Graph (DAG), G(V , E), where every node represents a computational operation, such as an addition or a multiplication, and every directed edge (u, v) represents a dataflow produced by operation u and consumed by v. In a scheduled DFG, every operation is assigned into a control step (c-step) . We use T to denote the total number of control steps. Hereafter, without explicit mention, we assume simplified DFGs, where each operation v takes exactly one c-step and produces exactly one output variable. We use the same notation for an operation and the variable it produces.
The compatibility relation is a partial order. The compatibility graph, with respect to scheduled DFG G(V , E), is denoted as G c (V , E c ), where E c is called the compatibility-edge set, and
Note that the definition of compatibility in this article is different from the traditional one, which requires u and v to have the same operation type in addition to their scheduling relationship. In the scheduled DFG of Figure 4 , there are compatibility edges from v 1 to v 2 , and from v 7 to v 10 , etc., which are not explicitly shown in the figure.
Operations u and v are incompatible if there is no compatibility edge (u, v) or (v, u) In a valid resource binding of a scheduled DFG G, each operation is assigned to a functional unit, and each variable is assigned to a register.
1 Two operations cannot share one functional unit if they have a different functional type or are incompatible. Two variables cannot share one register if they are lifetime conflicting. In the DFG of Figure 4 , operations v 1 and v 6 cannot share the same functional unit, while variables v 6 and v 9 may share the same register to hold their value since their lifetimes are disjointed.
A valid resource binding defines a complete datapath, including the multiplexors required to connect the functional units and registers. In addition to the numbers of the functional units and registers, the multiplexor structures usually impact the timing and area of the datapath dramatically.
DRFM Binding
Using the definition of DRFM in Section 3, within island I, a local operation issued in functional unit pool FUP(I) always writes its output variable into local register file LRF(I), which has only one write port. Therefore, in a valid resource binding of G onto DRFM M, if operation v is assigned to FUP(I), then its variable v must be stored into LRF(I) at c-step T (v), and any other operation cannot write data into LRF(I) at T (v). Noting this, and if we ignore the detailed way in which variables are allocated and addressed within a register file, we have the following definition.
Definition 4.3 is simpler than the traditional definition of discrete-registerbased binding because it unifies the binding solution for both operations and variables. It is a hint leading to a cleaner problem formulation on DRFM binding.
Based on Definition 4.3, we have that for any feasible B(G, M), the operations bound in the same island must be in a chain in G c . Since each island has only one write port in its LRF, and each operation takes exactly one c-step and produces one variable, the operations bound in the same FUP must be scheduled into different c-steps. If the operations are sorted according to their c-steps, the compatibility edges among adjacent operations will form a chain in G c .
Hereafter, for binding solution B(G, M), we will not distinguish an island and its associated chain in G c ; that is, I = B(v) represents both the island to which v is bound and the chain in G c that contains v. For the example in Figure 4 , chain I a = {v 1 , v 2 , v 3 , v 4 } is bound into an island, as is chain I c = {v 6 , v 7 , v 8 }. In this example, at least four islands are required to obtain a feasible binding, since there are at least four chains in this scheduled DFG. This is also indicated by the following property. Intuitively, the local dataflows within a chain are carried through local physical connections between the LRF and FUP, while interchain dataflows have to be carried by global interisland connections. Since DRFM assumes point-topoint interisland connections, two dataflows can share a global connection if and only if they are produced from a common chain at different c-steps and also consumed in another common chain at different c-steps. In the same example of Figure 4 , dataflows (v 1 , v 7 ) and (v 2 , v 8 ) may share a global connection between island I a and I c . In contrast, dataflows (v 6 , v 9 ) and (v 7 , v 9 ) must use two different global connections between I c and I d since they are consumed at the same c-step.
THEOREM 4.4. B(G, M) will not be feasible if the number of the islands in M is less than the minimum number of node-disjoint chains in G c .
Interisland Connections
In a feasible DRFM binding
can share a common interisland connection if and only if u j = v j ; that is, the two dataflows are consumed by two different and compatible operations. 
IIC B (I i , I j ) must be not larger than the maximum number of fanins of all the operations of the chain I j .
For Figure 4 we have the results shown next, given that dataflows (v 1 , v 7 ) and (v 2 , v 8 ) can share an interisland connection, while (v 6 , v 9 ) and (v 7 , v 9 ) cannot share one.
As shown in Section 8, the dataflows from island i to j are very likely sharable by a common physical connection. In general, IIC is not easy to compute because of the sharing-relations, which are determined by the compatibility relations. The calculation of IIC is equivalent to solving a graph-coloring problem for the incompatibility-graph of dataflows from one island to the other. However, the incompatibility-relations among the operations within one island is very sparse, and the IIC numbers can be computed fairly fast. 
PROBLEM DEFINITION AND COMPLEXITY ANALYSIS
Problem Definition
The interisland connections are critical to the final DRFM qualities (also shown in Section 10.1), since for any feasible B(G, M), the input-port number of island I is equal to the number of interisland connections feeding into I. For example, island A in Figure 3 has four input ports because there are four global connections feeding into it. Figures 5(a) and 5(b) show two schemes of DFG partitioning. Figure 5 (b) has more interisland connections and more MUXes due to its failure to transfer as many dataflows through intraisland connections as possible. This implies that the complexity of the data-routing logic, which impacts the design area and critical path timing, is determined by the interisland connections, and it suggests the following problem formulation. The task of DRFM configuration is to decide the number of islands, K , and cluster the availabe resources into these K islands; that is, to decide FUP for each island.
Problem 1 (DRFM Configuration and Binding for Minimum Interisland Connections). Given a scheduled DFG G(V ,
Since minimizing Total IIC(B) is expected to have a more global impact on design qualities, we give Total IIC(B) a higher priority over Max IIC(B) during optimization. We will show that Problems 1 and 2 are both NP-hard. The difference between Problem 1 and Problem 2 is that Problem 2 only minimizes Total IIC (B) . Since Problem 1 is harder than Problem 2, to prove Problem 1 is NP-hard, we only need to prove Problem 2 is NP-hard.
Problem 2 (DRFM Configuration and Binding for Minimum Total Interisland Connections). Given a scheduled DFG G(V , E) and the resource (functional
Complexity Analysis
In Mandal et al. [1998] , the authors prove that Problem PA3U1 is NP-hard, which is defined as follows. Given a set of variables which have been placed in a particular memory with three uniform reading ports and a set of circuit points which read variables from the memory at specific control steps, the port assignment problem is to assign the three memory ports to these circuit points such that: (a) all the accesses in each control step are satisfied and (b) the cost of multiplexers in front of circuit points is minimized. In the following, we will reduce Problem PA3U1 to Problem 2 and prove it is NP-hard. PROOF. Figure 6 shows an instance of Problem PA3U1, where the memory has three reading ports, p 1 , p 2 , and p 3 . There are k circuit points, c i , 1 ≤ i ≤ k. Each circuit point, c i , has a queue of d i number of variable-reading operations, r i, j , 1 ≤ j ≤ d i , which are distributed in c-step 1 to m.
•
35:13
The corresponding instance of Problem 2, a scheduled DFG, is shown in Figure 7 . For each circuit point c i , 1 ≤ i ≤ k, there is a node m i in the DFG scheduled at step 0. For the three reading ports of memory, there are three nodes, o 1 , o 2 , and o 3 , which are also scheduled at step 0. For each reading operation r i, j , there is a node o i, j in the DFG, which is scheduled at the same step as r i, j . Each o (o 1 , o 2 , o 3 , o i, j ) node has one output to be stored in register files, whose representing edges are omitted in the DFG of Figure 7 for the purpose of simplification. At last, there is a dataflow from
The reduced instance is configured so that all o nodes have the same functional type f (o) and thus can share functional units, and all m nodes have a functional type f (m), different from f (o). The given resource constraint includes three functional units of type f (o) and k functional units of type f (m).
Obviously, the reduction from Problem PA3U1 to Problem 2 is polynomial. Due to the aforementioned assumption of one writing port in register files, all the nodes at step 0 have to be assigned to different islands. Therefore, there is only one feasible DRFM configuration, which has k + 3 islands, and each island includes exactly one functional unit.
A solution of Problem PA3U1 is derived from a feasible solution of Problem 2 in the following way. If o i, j is bound to the same island as o q , 1 ≤ q ≤ 3, the reading operation r i, j will access port p q . Since nodes scheduled in the same c-step in Figure 7 will not be assigned to a common island, reading operations occuring in the same c-step in Figure 6 will not access the same reading port either. Hence, the solution of Problem PA3U1, derived from a feasible solution of Problem 2, is also feasible.
Since there is no dataflow among nodes m i , there would be no interisland connections required among the k m-islands. This is also true for the three oislands holding all o nodes. Hence, interisland connections are only required between the k m-islands and the three o-islands. If nodes o i, j 1 and o i, j 2 , which have the same input nodes m i , are assigned to the same island, dataflows m i → o i, j 1 and m i → o i, j 2 can share the same interisland interconnect. This is exactly the sharing rule between circuit points and memory ports: If a circuit point reads two variables from a same memory port, only one connection is needed. Therefore, Total IIC of Problem 2 equals the cost of multiplexers of Problem PA3U1.
Based on the preceding facts, we conclude that an optimal solution of Problem 2 leads to an optimal solution of Problem PA3U1, thus proving Problem 2 is NP-hard.
From Theorem 5.1, we have the following conclusion. THEOREM 5.2. Problem 1 is NP-hard.
AN INCREMENTAL RESOURCE-CONSTRAINTED DRFM CONFIGURATION AND BINDING ALGORITHM
Problem 1 is not easier to solve than the traditional binding problem for connectivity optimization [Pangrle 1991 ]. In addition, the global connections among DRFM islands may be shared by multiple dataflow edges (see Section 4.3); and In this section we will introduce the incremental resource clustering and binding algorithm with the goal of optimizing interisland connections.
Algorithm Flow
Before introducing the algorithm, we give the following definition, which will be used in DRFM configuration.
Definition 6.1. In a DRFM configuration M, islands I i and I j are combinable if and only if FUP(I i ) ∩ FUP(I j ) = ∅; that is, these two islands do not have functional units of the same type in common.
Due to the assumed limitation of only one writing port for each register file (Section 3), a valid binding solution would not assign two or more operations scheduled at the same clock cycle into the same island. Therefore, for operations of the same type, only one functional unit of the corresponding type is needed for each island, and putting two or more functional units of the same type into one island would be a waste of resources. On the other hand, functional units of different types are allowed and necessarily reside in the same island, since they could perform different functionalities at different cycles. Figure 8 shows the main flow of the incremental configuration and binding algorithm. The basic idea is that we start from DRFM configurations with large numbers of islands and perform binding based on them. Then islands having a large amount of communications are combined together gradually to hide interisland dataflows.
In the algorithm, we select the toal number of available functional units as the initial total island number, which means that in the initial M, there are |FU| number of islands and each island contains only one single functional unit. ISB Binding is then performed, targeting M to assign operations to islands. The optimization goal of ISB Binding is to minimize interisland connections. Details will be introduced in Section 6.2. In the following iterations, those islands which are tightly coupled will be grouped together to form new islands. We always select the pair of islands which have not been tried and also have the largest number of connections and dataflows between them. In this way, both interisland physical interconnects and logical dataflows are minimized. Whenever DRFM configuration is updated, ISB Binding will be called to verify if there exists a valid binding solution. If yes, the latest DRFM configuration will be accepted as the starting point for the posterior island combinations. Otherwise, the newly combined islands will be split and the remaining pairs of combinable islands will be tried. As the incremental process continues, islands are becoming more and more compact and interisland connections will be hidden and become intra-island connections gradually. When no more islands can be combined, the iterations stop and the best DRFM configuration and binding solution in the sense of interisland interconnects is retrieved. A postprocessing refinement, VR Binding, is then performed to achieve further optimization. VR Binding will be introduced in detail in Section 6.3.
Here, we use both Total IIC(B) and Max IIC(B) to evaluate the quality of solutions. As explained in Section 4.3, Total IIC(B) has a higher optimizing priority over Max IIC (B) . Therefore, we always choose those solutions with the minimum number of Total IIC(B) and then use Max IIC(B) to break the tie. If there is still more than one solution remaining, we choose the one having the smallest number of islands. The cost functions used in ISB Binding and VR Binding also evaluate Total IIC(B) and Max IIC(B) in the same priority.
During the island reorganization, we only apply one simple scheme: island combination. This scheme works well only for designs with a few types of functional units. Otherwise, a more intelligent DRFM configuration mechanism, which should be capable of combining, splitting, and recomposing, is needed to achieve better DRFM configurations.
Note that for general cases with multicycle/pipelined operations or register files with more than one write port, combinability is not necessary for island combination.
ISB Binding Algorithm
For each DRFM configuration M, we apply an iterative control-step-by-controlstep bipartite approach, ISB Binding, to assign operations to islands in M with the goal of minimizing connections. Each iteration takes in the current partial binding solution B (G, M) and constructs an updated solution. The algorithm terminates when a complete binding is obtained.
In each iteration, we consider the set of operations within the current cstep. Since the operations in are pairwise incompatible, they must be assigned onto different islands. We apply a minimum-weighted bipartite matching algorithm to obtain an assignment solution. A similar idea was presented in Huang et al. [1990] for general datapath allocation.
Given and the current partial binding B (G, M), we construct a weighted bipartite graph G bp (V V M , V × V M ) as follows.
(1) For each operation v there is a node n(v) ∈ V , and for each island I i there is a node m( The cost of the attempted binding of operation v to I i is defined as
where New ICC (v, I i ) is the number of the new interisland connections introduced by the assignment of v to I i , and the value of Max ICC(v, I i ) is 1 if and only if I i has the largest number of feeding-in connections in the current partial solution. α is the weight for global interisland connections, and β is the weight for the maximum number of feeding-in connections. Since we prefer solutions with globally minimized connections, we give α a larger value over β. In experiments, we set α as the total number of operations in G, and β as 1.
A minimum-weighted bipartite matching E match ⊆ V × V M for G bp can be computed optimally in O(n 2 * log(n) + n * e) [Fredman and Tarjan 1987] , where n is the total number of nodes of G bp , and e is the total number of edges of G bp . For each edge in (n(v), m(I i )) ∈ E match , we bind operation v to island I i , and update B .
Obviously, any matching E match of bipartite graph G bp corresponds to a feasible binding of to M, and the total weight of E match equals the cost of the attempted binding. Therefore, the updated binding solution obtained by the minimum-weighted bipartite matching is optimal among all the possible bindings of to M, as in the following conclusion.
THEOREM 6.2. The binding of to M produced by the aforesaid algorithm introduces a minimum number of new interisland connections to the current partial DRFM binding solution and increases the maximum number of feedingin connections as little as possible.
Globally, the matching algorithm performs the binding in a "horizontal" fashion in the scheduled DFG, and cannot calculate its impact on the future iterations. For this reason, we perform a postrefinement to further improve the results.
Postrefinement
After we get a reasonable and valid DRFM configuration and binding solution, we apply a vertical local-search-based refinement, VR Binding, for further improvement. The refinement process uses an idea similar to the Kernighan-Lin algorithm [Kernighan and Lin 1970] , despite the fundamental difference between our problem and the classic graph-partitioning problem. It reassigns an operation to a different chain in order to overcome the "greediness" introduced by ISB Binding, while the refinement benefits in runtime from the good initial solution obtained from ISB Binding. The algorithm is described in the following steps.
(1) Set all the operations in the current partial solution to be unlocked for movement. (2) Find a movement of an unlocked operation from its current chain to another such that the gain is the maximum (even if the gain is negative) among all of the possible movements. This operation is locked then, and the movement history is recorded. (3) Repeat step 2 until all operations are locked. (4) Find the first K movements, such that their total gain is the maximum partial sum of the entire historical movement list. These K movements are committed, and the rest are recovered. (5) Repeat steps 1 to 4 until no movement is committed.
We use the simple example in Figure 4 to illustrate the process. At step 2, the maximal gain is obtained by moving operation v 9 from chain I d to chain I c , which reduces both Total IIC and Max ICC by one. The other movements either increase or maintain the interconnect cost. Therefore, at step 4 we will commit the first K = 1 movements and recover the rest. The following iterations will not improve the partitioning further. Finally, we get the optimal solution with Total IIC = 4 and Max ICC = 2.
After the operation(variable)-to-island binding, we perform a detailed binding within each island. In particular, a register file is allocated for the set of variables assigned to it. Traditional register binding techniques, such as graphcoloring and left-edge algorithms [De Micheli 1994] , may be conducted to minimize the size of each register file by sharing a register element for multiple compatible variables. Functional unit binding is trivial since the operations within an island are pairwise c-step compatible.
OPTIMALITY STUDY
One potential problem of the algorithm proposed in Section 6 is that the searching may be stuck in local minimal solutions. To evaluate the effectiveness and optimality of the heuristic, we formulate Problem 1 as an ILP formulation. ILP either maximizes or minimizes an objective function of a set of variables, subject to a group of linear equation and inequality constraints and integral restrictions on all of the variables.
In the following we assume the given resource constraints include only adders and multipliers. However, the ILP formulation can be easily extended for general cases.
Let A be the number of adders, and M be the number of multipliers. We give each resource instance an identifier ranging from 1 to A + M . Also, the maximum number of islands in a feasible DRFM configuration is also A + M , when each island has a single functional unit. Let O be the group of operations in DFG. We first define the following variables. Based on the preceding definitions, the constraint that each operation is assigned to one and only one island is described next.
The operations scheduled at the same c-step cannot be assigned to the same island.
∀i∈O,T (i)=t
For dataflow from o i to o j , suppose that o i is assigned to island p, and o j is assigned to island q. There will be a connection from island p to island q.
Given an operation k, suppose its inputs are produced by operations i and j , i = j . If operation k is assigned to island q, and both i and j are assigned to island p, then there will be two connections from island p to island q.
The number of feeding-in connections for island p is the sum of connections coming from all the other islands.
max in is the maximum of the feeding-in connections of all islands.
If there is at least one addition assigned to island p, there will be an adder in island p. 2 The same constraint applies to multipliers.
The given resource constraints cannot be violated.
The objective function is the weighted sum of Total IIC and Max IIC.
HANDLING READ-PORT RESTRICTIONS
The algorithm in Section 6 produces a feasible DRFM binding solution. However, it ignores the read-port limitations of register files. A large number of read ports would increase the accessing timing of register files and thus jeopardize the timing of the whole design. In a c-step, if several operations in multiple chains consume the variables produced from the same chain I, then multiple read ports are needed by LRF(I). As shown in Figure 4 , on c-step 4, four operations access three variables v 1 , v 2 , and v 3 , which are produced from chain I a ; therefore LRF(I a ) needs at least three read ports.
There is an opportunity to reduce the read-port number requirement by spreading simultaneous reads throughout different c-steps, using the slacks of dataflows. In the DFG of Figure 4 , for dataflow (v 2 , v 8 ), which is produced in chain I a and consumed in I c , if we could transfer the value from LRF(I a ) to some buffer in I c at c-step 3, then at c-step 4 v 8 can access the local buffer to retrieve the data instead of accessing LRF(I a ). This way, a read port is saved for LRF(I a ).
To support this mechanism, we need a refinement on the DRFM to allow selective variable replication. In particular, we add a set of storage elements, namely input buffer, into the data-routing logic for each island, and thus allow direct data routes from an external LRF to the input buffer, as shown in Figure 3 . Such direct data routes are called data-forwarding paths. A data forwarding χ (u, v, t) reads out the value u from LRF(B(u)) and writes it into an input buffer of B(v) through a data-forwarding path at c-step t.
The problem of rescheduling a set of dataflows on data-forwarding paths under read-port constraints, named data-forwarding scheduling, is defined as follows.
Problem 3 (Data-Forwarding Scheduling). Given a positive number N and an island I ∈ M with respect to a feasible DRFM binding B(G, M); for each dataflow (u, v) where B(u) = I and B(v) = I, schedule a data forwarding χ (u, v, t) where t ∈ [T (u) + 1, T (v)], such that at any c-step there is no more than N simultaneous data forwardings.
Obviously, a successful solution to Problem 3 guarantees that no more N simultaneous reads to LRF(I) happen at any c-step, so that N read ports are sufficient for LRF (I) . If no solution is returned, it indicates that the given read-port number N is too tight. In this case, as a final resort, the register file is duplicated to increase its read-port number.
For each feasible data forwarding χ (u, v, t) , t ∈ [T (u) + 1, T (v)] must hold; that is, it has a start time T (u) + 1 and deadline T (v). In addition, each data forwarding takes exactly one c-step and requires exactly one read port of its source register file. If we view a data forwarding as a task with unit execution time, and view read ports as processors, then each task requires a processor, and they have identical execution times and nonequal deadlines and ready times. Furthermore, the tasks have no precedence (or dependency) relation. Therefore, the problem is a special case of the deadline scheduling of tasks with ready time. This problem is solvable in O(n 2 ) by an Earliest-Deadline-First (EDF) algorithm Blazewicz [1979] , where n is the number of tasks.
Revisions of the algorithm in Blazewicz [1979] are needed for minimizing the input-buffer numbers and for special cases. For example, when two dataflows are produced by the same operation, they can share one read port; when two dataflows feed into the same island but at different c-steps, they may be able to share an input buffer.
EXTENSIONS TO GENERAL CASES
CDFG
Although we present the DRFM binding algorithm for dataflow graphs, there is no fundamental difficulty in extending it for general Control-Data-Flow Graphs (CDFGs). The main extension is regarding the change of the compatibility definition. Specifically, operations scheduled at the same step but under exclusive conditions (e.g., in different branches of an if or case statement) are also compatible and allowed to be allocated into the same island. Although the algorithm presented in Section 6 can be applied to CDFGs without any change, it may result in inferior solutions. The reason is that those compatible operations under the new compatibility definition are not allowed to reside in the same island, which decreases optimization opportunities and may lead to larger interisland connections.
To handle the new compatibility with CDFGs, we extend the ISB Binding algorithm presented in Section 6.2 in the following way. For operations in each control step, we first divide into disjoint groups such that operations in the same group are incompatible with each other and thus must be assigned to different islands. The division of operations is performed as follows. After group division, we apply the bipartite matching algorithm to sequentially assign each group to islands. Since compatible operations are in different groups, they have a chance to be allocated in the same island. In step (2) of the bipartite graph construction, besides the resource availability requirement, we also require that operation v is compatible with all operations which are already assigned to island I i .
In addition, the postrefinement in Section 6.3 will use the new compatibility definition when moving operations among islands.
Chaining
Operation chaining may introduce tricky situations in DRFM binding [Stok 1992; De Micheli 1994] . We try to collapse and assign an entire operation chain into one island to maintain the computation locality. When we collapse a set of chained operations, the resulting larger complex operation could have multiple outputs. An example is illustrated in Figure 9 (a), where a 2-output complex operation is formed. The multioutput situation causes a difficulty for our problem formulation, which assumes single-output operations only (for the 1-write-port restriction for register files). There are several ways to handle this special case: (i) if a collapsed operation is not too large and the output number is small, we perform necessary operation duplications and split the large operation into several single-output operations (as shown in Figure 9 (b)); (ii) otherwise, we can bind the complex operation into a special island that has a multiple-write-ports register file (or simply discrete registers).
EXPERIMENTAL RESULTS
The binding algorithms for DRFMs are implemented in the UCLA xPilot synthesis framework . The complete synthesis flow is illustrated in Figure 10 . In this framework a behavioral description in C/SystemC is first parsed and optimized into a dataflow graph. The synthesis engine begins with latency-driven scheduling and generates a scheduled DFG/CDFG. The DRFM configuration and binding algorithm is then applied on the scheduled DFG/CDFG to explore a desired DRFM configuration and binding solution. The data-forwarding algorithm described in Section 8 accepts the DRFM binding and tries to meet given LRF read-port constraints. At last, a backend program generates VHDL RTL, which is accepted by existing logic synthesis and physical design tools. In this work we report the results of experiments targeting the Xilinx Virtex4 FPGA platform [Xilinx] , using ISE v9.1 as the downstream tool.
A set of real-life test cases is used in our experiments. The pure DFG test cases include several different discrete-cosine transformation algorithms, such as DIR, LEE, and WANG, and several DSP programs, such as HONDA and MCM [Srivastava and Potkonjak 1995] . Three other test cases, MATMUL, CFTMDL, and CFT1ST, are CDFGs from MediaBench [Lee et al. 1997] and an FFT package [FFT ] . All the benchmarks are data-intensive applications. Table II shows how global interisland connections are correlated with the QoR on design DIR. For the same scheduling result, we perform three different DRFM binding approaches: one random binding approach and the optimizing algorithm with two different efforts; hence we obtain three solutions. The first column of Table II lists the InterIsland Connection numbers (IICs) of the resulting datapath reported by our synthesis system. The second to fourth columns are the resource results reported by Xilinx ISE after place-and-route, namely the slice, LUT, and flip-flop (FF) counts. In the Virtex4 device, a slice contains two LUTs and two FFs. The slice count represents the total resource usage, and the LUT and FF numbers show the resource distribution. The last column, CLK, is the achievable clock period (or path delay) reported by ISE's static timing analyzer. We set the timing constraints as 8ns for all the experiments.
Sensitivity to Interisland Connections
Overall, the table consistently shows a proportional relation among the interisland connection numbers and the design area numbers. The delay numbers vary within a reasonable range, while the minimal area solution has the best performance. The results suggest that the minimization of interisland 
Comparisons of QoR
For a fair comparison, we implemented a discrete-register binding algorithm presented in Chen and Cong [2004] , which is to optimize multiplexers during register binding. We applied the same algorithm for functional unit binding. As reported in Chen and Cong [2004] , the binding results are much better than the traditional left-edge algorithm [De Micheli 1994] , which allocates a minimum number of registers but frequently generates complex multiplexor structures. It is also better than the bipartite algorithm [Chen et al. 2003 ]. In addition, we ran through another three flows for comparison, listed as follows.
(1) Discrete register: resource binding based on discrete registers [Chen and Cong 2004] . Table III shows a comparison of the QoR for the aforementioned four flows. The first column shows the benchmarks. The numbers of "1" and "2" attached at the end mean different resource-constraint settings. For each flow we list the resource results (slice, LUT, and FF/RAM) reported by Xilinx ISE after place-and-route. These columns have the same meaning as those in Table II , except that columns "FF/RAM" also list the number of RAM blocks used to implement register files. Note that the results for discrete-register datapaths use no RAM blocks, since no register file is applied. This table also shows that the register-file-based approach still uses some discrete registers. The reason is that occasionally the variables produced in an island may be lifetime compatible with each other, and thus they can be merged into a single register. In other words, an LRF may be reduced into a register so that no RAM block is needed. From the results in Table III , we can see that all the three register-file-based flows achieve around 50% resource reductions on average due to the saving of multiplexers.
3 Figure 11 shows the comparison of clock-period results by the four experimental flows, and Table V lists the total number of interisland connections and maximum number of feeding-in connections by the three flows targeting DRFM, but with different cost functions. For half of the cases, the timing of flow RF No Cost is even worse than Discrete Register due to its large number of global connections, which increases the size of data-routing logic and also puts extensive pressure on physical placement and routing. On average, the 3 For designs lee and wang, register-file-based flows do not decrease discrete registers significantly. The reason is that to make each island physically compact, we do not put arithmetic FUs and I/O FUs in the same island. The scheduling of lee and wang determines that the execution of most I/O operations conflict with each other and thus have to be assigned to different I/O FUs in different islands. Therefore, most of the variables from these I/O FUs reside in different islands and thus are stored in discrete registers; that is, most I/O operations have their own dedicated registers. On the other hand, variables from arithmetic FUs and I/O FUs share registers in the discrete-register flow. In spite of the marginal reduction on discrete registers, our approach has much less MUX due to the fact that all the other variables produced by arithmetic FUs are stored in register files. On the other hand, RF ICC T and RF ICC T&M are 9.3% and 14.6% better than Discrete Register, respectively. The even better performance achieved by RF ICC T&M is due to its extra optimization on the feeding-in connections, which could help relieve local congestion. Table IV lists the power consumption of four flows given by xPower, which comes with the Xilinx ISE toolset. The results show that, compared with Discrete Register, the register-file-based flows consume almost the same amount of power: only 1% higher on average. The main reason is that register files are implemented with block RAMs, which consume a large amount of power and eliminate the power savings from DRFM's better area and interconnections.
The preceding experimental results indicate that register-file-based architectures can lead to great resource reduction, but to achieve better timing performance, synthesis algorithms should optimize global connection carefully. Figure 12 illustrates the comparison of interconnect results before and after performing VR Binding. VR Binding reduces Total ICC by from 1 to 9, and reduces Max ICC by 1 or 2. This shows that VR Binding does help improve the quality of DRFM binding. 
Effectiveness of VR Binding
Optimality Study
To study the optimality of the binding algorithm in Figure 8 , we tested it against the ILP formulation with small designs of around 15 operations. For mediumor large-sized designs used in Section 10.2, ILP fails to obtain final solutions due to its long runtime. Table VI lists We can see that RF ICC T&M achieves optimal results on almost all the designs, except that for "test6", Max ICC of RF ICC T&M is one higher than the optimal results.
Results of FPGA Placement
As mentioned in Section 3, one of the important factors impacting the qualities of the DRFM microarchitecture is that all the resources in a same island need to be placed closely. If this is not the case, the timing of designs would be jeopardized due to long intraisland interconnects. Also, the optimization goal of the proposed binding algorithm might not be right either. Since most dataflows are carried through intraisland connections rather than interisland connections due to optimization of the binding algorithm, we expect that FPGA physical synthesis tools could make the right decisions for us based on the generated VHDL files. Figure 13(a) shows the graphical resource placement of the Xilinx Virtex4 device xc4vlx60 in FPGA Editor [Xilinx] . Figure 13(b) shows the zoomed-in detail of the rectangle area in Figure 13 (a) for design dir1. The left two bars are the register file, and the bar on the right side is the multiplier DSP. They logically belong to the same island in the DRFM configuratioin. The other resources, like adders and data-routing logics, are implemented around Table VII lists results of the data-forwarding scheduling algorithm for handling the read-port number restriction of LRFs. In this experiment, the functional unit constraints are set as x Ay M and indicate that x ALUs and y multipliers are provided. During the DRFM binding stage, we choose x + y as the number of islands (i.e., each functional unit forms an FUP). For data-forwarding scheduling, we set N to 2, meaning that the read-port number is restricted to no more than 2. The first number in each cell is the number of rescheduled dataflows, and the second is the allocated input buffer. From the experiment, we observed that for all the test cases after DRFM binding, the resulting readport numbers are already less than 4. The data-forwarding scheduling algorithm returns success for all the cases, reschedules up to 12 dataflows, and allocates up to 10 input buffers. The table also shows that in many cases no reschedule is required, since the DRFM binding solutions meet the read-port restrictions already. Note that the results are not monotone with the resource constraints, since the binding and data-forwarding algorithms are discrete and sensitive to the DFG scheduling, which in turn is very sensitive to resource constraints.
Results of Handling Read-Port Limitations
CONCLUSIONS
The Distributed Register-File Microarchitecture (DRFM) enables efficient use of distributed embedded memory blocks in modern FPGAs. It provides a useful architectural template for behavior synthesis and a direct optimization objective: minimizing interisland connections. Two novel algorithms are proposed during the resource binding stage: (i) the DRFM configuration and binding algorithm focus on the minimization of interisland connections; (ii) the dataforwarding scheduling algorithm takes advantage of the operation slacks to handle the read-port restriction of register files. On the Xilinx Virtex4 device, our experimental results show a 50% logic area reduction, with a 14.6% improvement in design performance when compared to a traditional discreteregister-based approach. The results are consistent with the significant globalinterconnect and multiplexor reductions achieved by our approach. Also, experiments show that our algorithm achieves the same number of total connections and at most one more maximal feeding-in connection compared to optimal solutions that are generated by ILP formulation.
In the article we ignored many important factors in practical applications, such as pipelined operations. These cases should be handled carefully in the real DRFM binding implementation.
Despite the encouraging results, several directions may be investigated for further improvement on the DRFM synthesis topic. Since scheduling determines the parallelism of the scheduled DFG, it will greatly impact the result of DRFM binding. Forward-looking heuristics during scheduling [Wong et al. 2002] and scheduling-aware partitioning for interconnect optimization [Lim et al. 2007 ] shall be very beneficial to the final DRFM quality. Although our current binding algorithm is efficient and flexible, a more global and less greedy binding algorithm could be developed to further minimize interisland connections. Also, our binding algorithm does not directly consider register-file optimization. The reduction of the number of register files and the number of reading/writing ports will lead to better power consumption.
