The design of a common architecture that can support multiple data-flow patterns (or contexts) embedded in complex control flow structures, in applications like multimedia processing, is particularly challenging when the target platform is an FPGA with heterogeneous mixture of device primitives. In this paper, we present scheduling and mapping algorithms that use a novel area cost metric to generate resource aware context adaptable architectures. We present the results of a rigorous analysis of the methodology on multiple test cases. Post place and route results are compared against published techniques and show an area savings and execution time savings of 46% each.
I. INTRODUCTION
The malleable logic and routing fabric of Field Programmable Gate Arrays (FPGAs) have always held an attraction to VLSI architecture designers, since they allow for a highly customized design to be created in register transfer level (RTL), synthesized, mapped, placed, routed and tested. In a sense, the use of FPGA offers instant feedback to the architect when compared with the traditional ASIC design process. While designers have always taken advantage of creating custom architectures on FPGAs for data-flow graphs (DFGs), more recently there have been a variety of investigations published, to explore rapidly adaptable designs running on an FPGA. These efforts can be broadly classified into two categories: (i) polymorphic designs which are soft configurable i.e., do not need a change in the underlying bit-stream, hence allowing for rapid adaptability, and (ii) partial and dynamic reconfiguration (PDR) methods, which require reconfiguration of the bit-stream and are relatively much slower than configuring polymorphic designs. In this paper we restrict the discussion to efforts carried out in the first category. Of particular interest are the methodologies to derive polymorphic circuits, viz. the scheduling and binding/mapping algorithms to obtain a common architecture that can support all the DFGs in a control-data flow graph (CDFG). The rest of this paper is organized as follows. In section II we review existing techniques for generating data-paths for control and data-flow applications, and some techniques for area estimation on FPGAs. In section III we present some preliminaries on estimating area cost of FPGA architectures followed by our proposed methodology including scheduling and mapping algorithms. Post place and route results of our approach in comparison with few prior algorithms are presented in section IV and conclusions are presented in section V.
II. BACKGROUND AND RELATED WORK
A CDFG representation of an application is a commonly used intermediate format in high-level synthesis (HLS) tools. A CDFG consists of basic blocks (BBs) embedded in different forms of control structures. A DFG is a graphical representation of a set of operations and their data dependencies. It is denoted as G(V, E), where V is the set of operations (nodes) and E is the set of edges that represents data-flow among the set of nodes. In this paper, we refer to the execution of a DFG as a 'context'. The derivation of an architecture (at the RTL level) for a given CDFG involves three well known tasks in the area of HLS: (i) scheduling (ii) allocation and (iii) binding/mapping. Scheduling assigns operations to particular time steps of execution. Allocation determines the number and types of functional units to be used in the design. Binding maps the operations in the scheduled CDFG to functional units. Binding is also responsible for determining the resources for data routing. While some scheduling algorithms just deal with assigning operations to time steps, other scheduling algorithms include all the above three tasks ((i), (ii), and (iii)). As scheduling provides multiple architectural options, we need to estimate FPGA resources for each option and select the best one. In this section we present background and related work in the following three categories: (a) DFG scheduling techniques (b) CDFG scheduling techniques and (c) FPGA resource estimation techniques.
A. DFG scheduling techniques
There are two classes of DFG scheduling algorithms: (i) time-constrained scheduling (TCS), and (ii) resource-constrained scheduling (RCS). TCS algorithms try to reduce the number of resources required to schedule a DFG within the specified execution time (time constraint), whereas RCS algorithms attempt to minimize the execution time by finding the best possible schedule using the given set of resources (resource constraint). Integer linear programming (ILP) techniques have been proposed for TCS and RCS algorithms [1] to generate optimal solutions, but the only downside of these techniques is their large execution times (exponential). Therefore, these techniques can only be used for very small DFGs. Alternately, heuristic techniques were developed. Such techniques include, but not limited to, as soon as possible (ASAP) scheduling [2] , as late as possible (ALAP) scheduling [3] , list scheduling (LS) [4] and force-directed scheduling (FDS) [5] . The limitation of ASAP and ALAP algorithms is that they do not give priority to critical path nodes. Hence, they result in larger area than other scheduling techniques.
LS algorithm is a RCS algorithm, which prioritizes the nodes based on urgency and generates a schedule satisfying the input constraints specified as the number of resources of each type (i.e., mul, add, div, etc.). FPGAs are heterogeneous mixture of look-up tables (LUTs), flip-flops (FFs), digital signal processor units (DSP48s), and embedded block RAMs (BRAMs), which are collectively termed as device primitives. A resource to be mapped on an FPGA can have multiple flavors of implementations and each flavor can consume a different mix of device primitives. Resource constraint for a design targeting an FPGA will be specified in terms of its device primitives. Therefore, we need to convert the input resource constraint that is specified in terms of device primitives to a set of possible resource constraints in terms of the number of resources of each type, before using LS algorithm for FPGAs.
FDS algorithm is a TCS algorithm which relies on ASAP and ALAP algorithms. This algorithm tries to reduce the number of resources by uniformly distributing operations of the same type (using predecessor and successor forces). For designs targeting FPGAs, the output resource set from the FDS algorithm needs to be converted to the number of device primitives, to see if the design can be accommodated in the available FPGA area. The criteria used in FDS algorithm is the force associated with each node and this criteria is updated dynamically, whereas in LS algorithm, the criteria used is the critical path length which is not updated. Hence FDS results in a better schedule than LS and therefore we prefer FDS over LS in our proposed algorithm.
B. CDFG scheduling techniques
As the major contribution of this paper is to generate architectures for applications involving complex control structures, in this subsection we evaluate some of the existing techniques that address this issue.
For applications involving control flow, the algorithms discussed in the previous section cannot be used directly to generate schedules. Attempts were made by Camposano [6] , Al-Sukhni et al. [7] , and Bergamaschi et al. [8] to schedule a control flow graph (CFG) by using path-based scheduling (PBS) method and its variations. PBS tries to minimize the number of control states under given timing and area constraints. CFG is first converted to a directed acyclic graph (DAG) by removing the feedback edges of loops. All paths in the DAG are scheduled using as-fast-as-possible schedule. All the schedules are combined (by overlapping) to obtain a finite state machine with least number of control states that can support the execution of the CFG. In PBS, the concurrency of operations and data dependency among operations is ignored, the execution time of the CFG is not addressed, and the order in which operations are executed in the DAG is fixed (same as the order in which operations are present in the input description). This might result in slower and inefficient schedules.
A conditional resource-sharing algorithm using hierarchical reduction is proposed by Kim et al. [9] . In this approach, a CDFG (not containing loops) is transformed to a DFG by replacing conditional blocks by equivalent nonconditional blocks. This is achieved by first determining the time frame of operations using ASAP and ALAP algorithms. Operations from the conditional branches are paired, only if their time frames overlap. A ratio is associated with each node and pairing is done in the decreasing order of these ratios. The resulting DFG is scheduled using force-directed list scheduling (FDLS) algorithm [10] . Schedule of the original CDFG with conditional branches is obtained by transforming the schedule information of the DFG. They assume unit latency for all the operation nodes and fork nodes. Though this algorithm was proven to perform better than PBS algorithm [6] in terms of number of control steps, there is no support for handling loops and complex control structures.
Lakshminarayana et al. [11] propose a more comprehensive scheduling algorithm, called W avesched-spec, which can support branches and loops. Their algorithm is a RCS algorithm that simultaneously speculates branches and loops (by performing loop unrolling) in order to minimize the number of clock cycles. The downside of this approach is the excessive use of registers to store the speculated values of unrolled loops (results were not presented for register usage).
Moreano et al. [12] propose datapath merging approach to synthesize a single data-path that can be partially reconfigured (through multiplexers) to support multiple DFGs in a CDFG. Their algorithm merges only those DFGs that are present inside loops, as they are the major candidates for hardware acceleration. The remaining DFGs are executed on a SPARC-v8 microprocessor. Their approach first builds a compatibility graph and then uses maximum weighted clique partitioning to find the common data-path. They do not support sharing of a resource among multiple operations within the same DFG. Though sharing of resources across DFGs is achieved by adding MUX trees, the edge conectivity when multiple DFGs are merged make the interconnect and the MUX tree very complex.
Guo et al. [13] , Kastner et al. [14] , and Cong et al. [15] propose to accelerate control and data-flow applications by extracting hardware templates for configurable processors. All the three works try to minimize the number of distinct templates and the total number of templates. A template here refers to a set of connected nodes. Guo et al. [13] , first generate a set of subgraphs of varying sizes, from size one to a predefined size (maxsize), using the process of node growing. These sub-graphs are then used to cover the input CDFG, and a set of templates is generated such that the number of distinct templates and the total number of sub-graphs are minimized. As the template selection is computationally intensive, the authors propose a heuristic method. They associate an objective function for each template which is based on the number of nodes present in a template and the number of non-overlapped matches of the template in the original CDFG. The objective function is used as the heuristic and a set of templates is selected such that the objective function is maximized. However, they do not provide the exact representation for the objective function.
Kastner et al. [14] propose a template generation algorithm for hybrid reconfigurable architectures. Their algorithm starts by profiling the graph for frequency of edge types (ex. mul-mul, add-mul, mul-add, etc.). Based on the frequency of edge types, nodes are clustered to form super nodes. The resulting graph is again profiled and nodes are clustered. This algorithm is repeated until sufficient number of super nodes (templates) are generated to cover the graph. These templates are then used to cover the graph such that both the number of distinct templates and the number of instances of each template are minimized. Also, each edge is clustered in a local optimal manner, which might result in global sub-optimal solution.
Cong et al. [15] propose an algorithm for generating application-specific instructions to improve the performance of configurable processors. Patterns (templates) are first generated by a node clustering algorithm. Such patterns can have multiple inputs (|IN (p i )| ≤ N in ∀ i) but only a single output (|OU T (p i )| ≤ 1 ∀ i) and are subject to the area constraint ( area(p i ) ≤ A). IN (p i ) and OU T (p i ) are the input set and output set of pattern p i and A is the input area constraint. The value of N in limits the number of parallel nodes present in the pattern. Each pattern is then characterized by gain and area. For a pattern, the hardware execution time (T hw ) is calculated to be its critical path length, and software execution time (T sw ) is calculated as the summation of the software execution times of all nodes. The speedup of a pattern is calculated as the ratio of T sw and T hw . Finally, the gain of the pattern is calculated as the product of the speedup and the occurrence of that pattern. From the set of templates generated so far, a set of templates is obtained such that the overall gain is maximized and the area constraint is satisfied. This set of templates is then used to cover the input graph while minimizing the execution time. This step is carried out using a 0-1 Knapsack problem. In all the three template based approaches ( [13] , [14] , [15] ), the objective to minimize the total number of templates might result in architectures with longer execution times. Also, the number of templates generated can be exponentially large for CDFGs with large number of nodes, thereby increasing the computational complexity of the algorithm. The complexity analysis of the algorithm is provided in [16] .
Bilavarn et al. [17] propose an algorithm to generate FPGA based architectures for CDFGs using area and delay estimations. They perform design space exploration for a specific RTL architecture template (bus-based architecture), by defining a parameterized architecture, rather than building one. They use time-constrained list-scheduling algorithm [10] for scheduling DFGs with multiple time constraints to generate multiple schedules. The schedules of the DFGs are combined hierarchically based on the control structure resulting in a set of solutions. Non-optimal solutions are then eliminated through Pareto optimality [18] . They categorize FPGA resources into (i) logic cells, and (ii) dedicated cells. Logic cells include LUTs and FFs, and dedicated cells include DSP48s and BRAMs. The output of the algorithm is a set of architectures with different area-delay trade-offs for each type of FPGA resource. It is up to the designer to choose a common architecture from the solution sets of each FPGA resource that best fits the FPGA. After all the DFGs are processed, area estimation for each type of FPGA resource is computed separately for each solution. Unlike [17] , in our proposed methodology we perform area estimation during architecture exploration offering instant feedback to the scheduling algorithm, which might yield better results.
None of the algorithms discussed in this section (excluding [17] ) consider FPGAs as the target architectures, and none of them consider the possibility of different implementations of a resource mapped on an FPGA. Our algorithm takes advantage of these facts to generate architectures for a CDFG. A quantitative comparison of [12] , [13] , [14] , and [15] , in terms of resource utilization and execution time, relative to each other and our proposed algorithm is provided in the results section. Results for [17] were not presented in this paper because sufficient information was not available in the publication.
C. FPGA resource estimation techniques
One of the most important constraints required when designing architectures for FPGAs is that the design fits inside the specified FPGA area. When exploring multiple architectures for a design, there is a need for early area estimations. This subsection discusses three approaches ( [19] , [20] , [17] ) towards generating FPGA architectures using some form of area estimation.
Nayak et al. [19] perform design space exploration on the input behavioral specification in MATLAB and pessimistically estimate the number of configurable logic blocks (CLBs) utilized by the generated hardware architecture. The MATLAB code is first converted to VHDL, which is scheduled using FDS [10] to obtain concurrency of operations, from which the number of function generators are determined. Register allocation is performed to calculate the total number of registers. The number of CLBs is then estimated using the number of function generators and number of registers. Their estimation is targeted specifically at the Xilinx XC4010 device and they do not include memory units.
Kulkarni et al. [20] propose an iterative compilation based algorithm that translates high-level single-assignment code (SA-C) into hardware using quick estimations. The objective of their estimation tool is to reduce the development time by quick estimations rather than to increase the accuracy. The SA-C code is first translated to a DFG representation whose nodes are associated with approximation formulae. Their estimation does not incorporate scheduling, resource allocation, and binding algorithms, but it takes into account some synthesis optimizations. As in [19] only one type of FPGA resource (LUT in this case) is considered for implementing arithmetic/logic units. Also, memory units are not handled.
In [17] , Bilavarn et al. propose area estimation of RTL architectures in terms of logic cells (lc) and dedicated cells (dc) computed separately. The total number of logic cells is estimated using (1) , where n k is the number of operations of type k and N k lc is the number of logic cells to implement a resource on FPGA that supports execution of operation of type k. The total number of dedicated cells is estimated in a similar manner. Unlike [19] and [20] , Bilavarn et al. estimate the resources for memory units.
A comparison of the above approaches is provided in Table I . It can be observed that the absolute error of the proposed methodology is the least.
III. PROPOSED METHODOLOGY
In this section, we present some preliminaries on area estimation of architectures implemented on FPGA, solution representation, our architecture template, algorithm overview, heuristic resource selection, resource aware scheduling algorithm, and mapping algorithm.
A. Preliminaries
As FPGAs have fixed logic and routing resources, certain designs might not fit on a given FPGA. For CDFGs involving multiple DFGs and control structures, the designer has to explore multiple designs which satisfy the latency constraints and that fit on the chip with least area requirements. A heuristic algorithm is presented in this paper for resource selection, which is based on resource estimations, without having to go through the time-consuming synthesis, map, and place and route (P&R) tools. As already mentioned in Section II, FPGAs are heterogeneous mixture of device primitives (LUTs, FFs, DSP48s, and BRAMs), and a resource to be mapped on an FPGA can have multiple flavors of implementations and each flavor can consume a different mix of device primitives. Therefore we propose a weighted sum of device primitives (WSDP), a metric to evaluate the relative area cost of resources mapped onto FPGAs. WSDP for a resource r is calculated using (2) , where k i is the number of device primitives of type i required to implement the resource r and a i is the number of available device primitives on the FPGA. Standard deviation for a resource r is calculated using (3).
The final area cost of a circuit can then be calculated using (4), where R represents the set of distinct resources, R = {add, mul, div, etc.} and n r is the number of resources of type r ∈ R, obtained from the scheduling algorithm (explained later in section III-D). W cumulative represents the relative area cost and hence can be used to compare area cost of two different circuits. The WSDPs for multiplexers (W mux ) and delay registers (W reg ), necessary for routing data among FUs based on the scheduled DFG, are calculated by using (2) in the mapping algorithm (explained later in section III-E).
A solution (architecture for a CDFG) is represented as a 4-tuple (S, W, A, T ) calculated using (4, 5, 6, 7) . A resource set S (shown in (5)), represents the set of resources obtained after scheduling a DFG/CDFG using a time constraint. W represents the cumulative WSDP of the circuit (from (4)). A is the set of available device primitives. T represents the execution time of the CDFG (shown in (7)). The execution time of a CDFG (T CDF G ) corresponding to a resource set can be calculated as a summation of product of execution time (T i ), weighting factor (f i ), and total number of effective iterations (N i ) of DFG i, over all the DFGs. The weighting factor of a DFG can be either probabilistic (i.e., when the branch conditions and loop terminating conditions surrounding a DFG depend on the input data set), or deterministic if they can be determined at compile time. The probability values can be obtained by profiling the application on various input data sets.
B. Context Adaptable Architecture Template
The architecture template used in our approach is composed of a data path and a controller. The RTL model of this template is shown in Fig. 1 . The data path consists of a set of functional units (FUs) and data routing network (i.e., delay registers and multiplexers). The outputs of FUs can be connected to the inputs of other FUs directly or through multiplexers and/or through delay registers, governed by the data dependencies present in the scheduled DFGs. FUs are determined by the number, type, and implementations of the resources obtained through the scheduling algorithm. FUs are generated using Xilinx CORE generator, whose characteristics are classified in an architecture library file and used by the scheduling and mapping algorithms. Characteristics here refer to the latency, implementation type, and area usage in terms of FPGA device primitives.
The multiplexers that are used to select the input data for the FUs are controlled by the schedule information (control words) stored inside the schedule table of the respective DFGs, which are controlled by the Global Controller depending on the context of the CDFG. The Global Controller is a simple finite state machine, which uses the FU outputs (for conditional branches in the input CDFG). The delay registers used are SRL16 shift register look-up-tables present on the Xilinx Virtex 4 FPGAs and are also generated using Xilinx CORE generator. This template is used to represent a custom architecture, defining the number of FUs, number of delay registers, number and size of multiplexers, and bus connections among all the blocks.
C. Algorithm Overview
The purpose of the algorithm is to generate an architecture that can support the execution of a CDFG and that can fit in the given FPGA area (area constraint). Given the heterogeneous nature of FPGAs and their flexibility to support multiple implementations of resources, the architectural options available are exponential. Our algorithm explores multiple designs and generates a set of architectures with different area-time tradeoffs which can support execution of the given CDFG. Now, we present a high level overview of the algorithm. Individual DFGs of a CDFG are scheduled using FDS with different latencies (time constraints) and different resource sets are obtained. These resource sets are merged across all DFGs to obtain common resource sets (explained in Section III-D), each of which can support the execution of all the DFGs. During the merge process, resource selection is done by evaluating WSDPs for all resource implementations. The number of available device primitives is updated after resources are allocated for the implementation with the least WSDP. As the resource sets associated with individual DFGs have different latencies, the merged resource sets have different latencies (CDFG execution time) and different relative area costs. A common resource set represented by its area cost (WSDP) and CDFG execution time (calculated using (7)) is termed as a partial solution. A partial solution set is obtained after resource sets of two DFGs are merged. We use Pareto optimality [18] to retain the set of optimal resource sets from partial solution set, before merging the resource sets of another DFG.
The common resource sets obtained after all the DFGs are processed are passed to a mapping algorithm (explained in Section III-E). The mapping algorithm calculates the resources required to route the data among FUs in order to support all the DFGs. Data routing resources refer to multiplexers and delay registers. A resource set along with the data routing information and the corresponding CDFG execution time is termed as a solution. The relative area cost of the resource sets is updated with the area cost of data routing resources and again non-optimal solutions are removed using pareto optimality. The final set of pareto optimal solutions represent the final set of context adaptable architectures.
D. Context Adaptable Architecture Exploration (CAAE) algorithm
The proposed CAAE algorithm, which includes resource selection, scheduling and mapping algorithms, generates a set of context adaptable architectures for a given CDFG. It takes as inputs, a CDFG to be synthesized, weighting factors of DFGs, number of iterations of all loops, and a set of available FPGA device primitives for use. The only constraint to the scheduling algorithm is that the design should fit in the given area (A) and there is no hard constraint on the execution time. Hence, the algorithm generates a set of best possible solutions with different execution times and area tradeoffs. It is then left to the designer to choose a solution with desirable execution time.
Algorithm: CAAE Input: CDFG, Weighting factors (f ), Number of loop iterations (N ), Available device primitives (A). Output: Set of context adaptable architectures. Notation: Let n DF G be the number of DFGs present in the CDFG.
4. Using FDS find all schedules and resource sets of G 1 . 5. For all resource sets, evaluate G 1 for WSDPs, CDFG execution time (using Evaluate procedure) and generate a set of partial solutions (P current ). 6. If P current is an empty set, the algorithm is terminated. 7. For i ← 2 to n DF G , do a. Using FDS, find all schedules of G i . b. X new ← Empty solution set. c. For a solution m in the set P current , do i. For all resource sets obtained in step-7a, evaluate G i for WSDPs, CDFG execution time, and generate a set of partial solutions (X m ). ii. Append solutions in set X m to set X new . End For loop d. If X new is an empty set, the algorithm is terminated. e. Retain set of partial Pareto optimal solutions (P current ) from set X new . End For loop 8. Calculate W mux and W reg using a mapping algorithm for each solution in the Pareto optimal solution set (P current ), and add them to the WSDP of each solution, resulting in a new set P current . 9. Retain Pareto optimal solutions from P current , resulting in a new set P f inal .
10.The final set of Pareto optimal solutions (P f inal ) along with their corresponding schedules represent the set of context adaptable architectures.
The algorithm starts by calculating the critical path latencies, in clock cycles, of all the DFGs using ASAP algorithm (step-1). In step-2, the lower bound on the execution time of the CDFG (T min ) is calculated using critical path latencies and (7). Steps 3 through 7 of the algorithm generate an initial partial solution set using one DFG, and then iteratively update the solution set using the remaining DFGs one by one. These five steps are explained in the next paragraph.
In step-3, an initial solution (S in , W in , A in , T in ) is created, where S in is an empty resource set, initial WSDP W in is zero, A in is the input available device primitives, and T in is the lower bound of the execution time (T min ). In step-4, DFG G 1 is scheduled using FDS with different latency constraints and multiple resource sets are obtained. The latency constraint is relaxed starting from critical path latency until the resource set obtained from FDS contains one resource for each operation type. This terminating condition is chosen because relaxing the latency constraint beyond this point does not result in a smaller resource set. The resource set corresponding to the critical path latency is the most parallel schedule (fastest execution -in terms of clock cycles) that can be achieved for G 1 . Similarly, the resource set corresponding to the terminating latency constraint is the most sequential schedule (slowest execution) that can be achieved for G 1 . In step-5, each resource set of G 1 is evaluated for area cost (WSDP) and execution time of the CDFG using the Evaluate procedure.
The Evaluate procedure takes in as inputs, a DFG, its corresponding resource sets, and an input solution. It returns a set of partial solutions with their associated area cost and execution time. In step-1, an empty solution set (X) is created. In step-2b, for each new additional resource present in S j over S in , the following three steps are repeated: (i) WSDPs of all resource implementations are recomputed based on the current available device primitives, (ii) implementation with least WSDP is selected for the current resource (i.e., resource selection) and resources are allocated, and (iii) WSDP of the current resource set and the number of available device primitives are updated. During resource selection, if two implementations of a resource have same WSDP, then we consider the implementation that has the least standard deviation (3). The execution time (T ) of the CDFG is calculated next using (7) for each resource set S j . Execution times of individual DFGs corresponding to the input solution are used except for the current DFG, for which, j is taken as the execution time. If the available device primitives are sufficient for the current DFG (i.e., if its value is non-negative at the end of all resource allocations), the current resource set S j , its corresponding WSDP, the updated available device primitives, and the execution time represent a solution and is appended to the solution set X (step-2d). The resulting solution set X (which represents a set of partial solutions) is returned to the CAAE algorithm. If the available device primitives are not sufficient to accommodate at least one resource set of the input DFG, the Evaluate procedure returns an empty set to the CAAE algorithm indicating insufficient FPGA resources.
Procedure: Evaluate
Input: Data-flow graph (G), resource sets of G, and input solution (S in , W in , A in , T in ). Output: Set of partial solutions (X). Notations: 1) Let S j represent the resource set of a DFG scheduled with latency j. S j = {n r | r ∈ R}.
2) Let S in represent the resource set in the input solution. S in = {n r | r ∈ R}.
1. X ← Empty solution set. Solution set obtained from the Evaluate procedure is represented as P current in CAAE algorithm. In step-7 of the CAAE algorithm, another DFG is selected and multiple resource sets are obtained as explained before. For each partial solution in the set P current , resource sets of the new DFG are evaluated for WSDPs and CDFG execution times, resulting in a new set of partial solutions represented by X new . As X new can have non-optimal solutions with a new DFG added, we use Pareto optimality [18] to retain optimal solutions in the current partial solution set. The optimal solution set is again represented as P current . The above process (steps-7a through 7e) is repeated for all the remaining DFGs.
For each resource set (S
After all the DFGs are processed in step-7, each solution in the set P current represents a resource set that can support all the DFGs. In step-8, the area cost of data routing resources (i.e., multiplexers and delay registers) is estimated using the mapping algorithm (explained later in Section III-E) for each solution in P current and added to the area cost of the solution itself resulting in a new solution set P current . All invalid solutions, which require more device primitives than those available, are pruned in the mapping algorithm. The final step is to retain the Pareto optimal solutions from the set of updated solutions (P current ) resulting in the final set of context adaptable architectures represented as P f inal .
E. The Mapping algorithm
The mapping algorithm is used to calculate the resources required to route the data among FUs, in order to support all the DFGs of a CDFG using the schedule information obtained from the CAAE algorithm. Data routing resources refer to the multiplexers, which steer data from one FU to another, and delay registers, which delay the output data of an FU with appropriate number of clock cycles before it is consumed by another FU. In step-8 of the CAAE algorithm, mapping algorithm is invoked for each solution in the Pareto optimal solution set (P current ). The mapping algorithm takes as inputs, schedules of all the DFGs, resource set of a solution and initial available device primitives (before the CAAE algorithm is invoked). It outputs the estimated WSDP values for multiplexers and registers.
Mapping of a scheduled DFG is carried out using a matrix called reservation table with number of rows equal to the latency of the DFG and number of columns equal to the total number of available resources present in the final resource set corresponding to the input solution. This table holds the associations of data nodes present in the DFG to the hardware resources. In order to reduce the number and size of multiplexers, when a node is mapped to a specific resource, the mappings of its parents' nodes are compared with the mappings of the parent nodes of the nodes mapped to the current resource. If a match is found, then the existing bus connections can be reused and hence the current node is mapped to the current resource. If a match is not found, the current node is mapped to a resource with least number of node mappings to distribute resource mappings. Since the parent node mappings are used during child node mappings, data nodes are processed in the increasing order of their schedules. Once the reservation table is populated with the mapping information of a DFG, a functional unit list (F U List) is generated. As a resource has two input ports, for each entry, F U List maintains two lists -one for the left port and one for the right port. Each port list is populated with only distinct node and/or register mappings. As there are multiple DFGs, reservation table is cleared and reused for each DFG and the port lists inside F U List are updated. After all the DFGs are processed, the number of entries in each port list indicates the size of multiplexers required. The pseudo code for estimating LUT usage for multiplexers, given the number of inputs, is shown in Fig. 2 . Register entries in each port list indicate the number of delay cycles and the number of delay registers required. Using this information and (8, 9), the device primitive usage (LUTs and FFs) for delay registers is estimated. These equations are formulated by creating several SRL16 cores for various delay values using Xilinx CORE generator and observing the area requirements from the post P&R reports. The estimated device primitives are returned to the CAAE algorithm.
LU T reg = number of delay cycles 16 * dataW idth (8)
IV. RESULTS
This section presents and analyzes the results generated by the proposed methodology in comparison with the prior work for various benchmarks taken from multiple application domains. Table II presents the test cases taken from applications including GNU scientific library (GSL), MPEG-2 video codec, MPEG-4 audio decoder, and H.264 video encoder. The control structures of these test cases are shown in Fig. 3 . Each test case represents a CDFG with BBs embedded in different types of control structures. BBs shaded in gray are composed of non-trivial DFGs (multiple nodes), which are major candidates for architecture generation process and BBs shown in white are composed of trivial DFGs (single node). The weighting factors of the BBs are calculated wherever the information is known at compile time; otherwise random values (in the range 0 and 1.0) are assumed.
Section IV-A discusses the accuracy of the proposed area estimation technique. Section IV-B provides analysis of the proposed methodology. Sections IV-C and IV-D present a comparison with prior work including datapath merging approach Estimate LUT MUX(Number of inputs (n), dataW idth) Wavelet transform WT Floating point GSL 3 Note: All floating point data types are single precision. [12], and template based algorithms [13] , [14] , [15] . Results are provided for architectures in terms of (a) relative area cost (in WSDP units), (b) execution time (in clock cycles) and (c) resource utilization (in terms of number of individual device primitives). As FPGAs are composed of four types of device primitives, the resource utilization is provided for each of the four device primitives for all the test cases. For all test cases, we consider Xilinx Virtex 4 family of FPGAs as the target device and Microblaze (v7.10) [21] as the soft core processor wherever used. As the proposed methodology generates a set of solutions with different execution times and area costs, during comparison, only one solution is chosen which is at least as good as or better than architectures generated by other approaches.
A. Accuracy of the proposed area estimation technique
The context adaptable architectures derived using the proposed methodology are implemented in verilog, synthesized and post P&R resource utilization is obtained using Xilinx ISE 10.1. Xilinx Virtex-4 SX35 FPGA [22] is used as the target device to realize the hardware architectures.
An estimation technique is proposed earlier in Section III-A, to estimate the resource utilization of a circuit. To confirm the correctness and accuracy of this technique, we compared the estimated values against the actual post P&R resource utilization of architectures generated by our proposed methodology as shown in Fig. 4 . A comparison of LUT utilization is shown in Fig.  4(a) . It can be observed from Fig. 4(b) that the percentage error in LUT estimation was less than 4.5% for all the test cases. Similarly, the percentage error in FF, DSP48, and BRAM resource utilization estimation was observed to be 0.2%, 0%, and 0% respectively.
B. Analysis of proposed methodology 1) Updating available device primitives: In the proposed methodology, scheduling and resource selection are performed simultaneously. For each additional resource required by the resource set of a scheduled DFG, WSDPs of all implementations are calculated and the option with the least WSDP is selected and the number of available device primitives is updated. The advantage of evaluating WSDPs of implementations, after allocation of each resource, is discussed here. Consider a scheduled DFG (say G 1 ), which requires three floating point multiplication units and two floating point addition units. Assume that device primitives available for mapping this DFG are: 1500 LUTs, 1710 FFs, and 18 DSP48s (Note that the available device primitives for mapping a DFG/CDFG need not be the total number of device primitives present on the un-configured FPGA). Routing resources are neglected, for sake of discussion. Table III shows the different implementations of the multiplier and adder, and their corresponding area costs (in WSDPs) are presented for different sets of available device primitives.
For the initial set of available device primitives, resources FMUL-2 and FADD-2 have the least area cost among other implementations of the same type. If these two implementations are chosen for multipliers (all three) and adders (both) for DFG G 1 , then a total of 1148 LUTs, 1588 FFs, and 20 DSP48s are consumed. As there are only 18 DSP48s available, this is an invalid solution. Therefore, we propose a new approach of selecting implementations for one functional unit at a time and then recomputing the area costs of different implementations (as shown in columns 5 through 9 and the last row). In column 5, as FMUL-2 has the least area cost among all multiplier implementations, resources are allocated for it, available device primitives are updated, and area costs are recomputed in column 6. Again FMUL-2 has the least area cost. However, in column 8 we see that FADD-1 has the least area cost, but when the device primitives were updated in column 9, FADD-2 has the least area cost. Using this approach, the final circuit consumes 1394 LUTs, 1710 FFs, and 16 DSP48s, which is a valid solution. 2) Optimal resource selection: In the following sub-section, we compare solutions generated by our heuristic resource selection algorithm with optimal solutions. The optimal solutions are generated by exhaustively listing all possible combinations of resources and their implementations. To reduce the complexity of testing, we made a few assumptions: (i) there is only one operation type in the entire CDFG, (ii) each DFG has only one schedule (say, critical path schedule), and (iii) resources available for mapping operations are implemented using only LUTs and/or DSP48s (FFs and BRAMs are not considered). Note that these assumptions are not limitations of the algorithm itself. As each DFG has only one schedule, there exists only one final solution for comparison, rather than a set of solutions as explained in the CAAE algorithm. Also, the execution time of the CDFG remains the same for our heuristic resource selection and optimal resource selection. The input available device primitives (LUTs and DSP48s only) are varied resulting in large number of test points for each test case as shown in column 3 of Table IV . Column 4 shows the number of points our algorithm generates optimal solutions. The last two columns indicate the percentages of maximum area cost overhead and average area cost overhead of our heuristic resource selection when compared to the optimal solution. As we can see, though the maximum area overhead is approximately 20% the average area overhead was not more that 3%. However, it is important to note that, as the proposed algorithm is heuristic, these values are greatly dependent on the amount of available device primitives at the beginning of the algorithm. The computation complexity of the proposed resource selection algorithm is O(nk) as opposed to O(n k ) for the exhaustive search algorithm, where n is the maximum number of resources of any type and k is the maximum number of flavors for any resource.
C. Comparison with datapath merging approach
This section evaluates the proposed architecture against the architectures generated using the datapath merging (DM) approach proposed by Moreano et al. [12] . For the DM approach, only DFGs present inside loops are processed in the decreasing order of number of nodes. The remaining DFGs (not present inside loops, but capable of hardware acceleration) are instead mapped onto a soft-processor. The area of the final merged architecture is calculated using (10, 11) . The Area static is the area cost associated with a soft-processor (Microblaze on Xilinx FPGAs), which has a constant value depending on the type of operations it has to execute as shown in Table V , and the number of available device primitives (using (2)). In (11), C j represents the total number of operations of type j present in the merged graph, V R is the number of distinct operations, and R j is the area cost (WSDP) of a resource that can implement operation j. Area multiplexers is calculated using the pseudo code shown in Fig. 2 .
The DM approach is applied for the seven test cases and a comparison is provided in Fig. 5 against the proposed method. Depending on the size of the test case, different sets of available device primitives are fed to the CAAE algorithm, as shown in Table VI . It can be seen that the proposed method generates faster and smaller circuit with huge area savings. Fig. 5(c) shows the final resource utilization for each of the individual device primitives for all the seven test cases. It can be observed that, though our method consumes more DSP48s, it results in a lower final area cost (Fig. 5(a) ) of the circuit by balancing out the device primitives. Note that DM approach failed to generate a circuit that can fit the target FPGA for one test case (IDCT) where the number of data nodes is large (∼80). 
D. Comparison with template based approaches (prior work)
The algorithms presented by Guo et al. [13] , Kastner et al. [14] , and Cong et al. [15] propose to accelerate applications by extracting hardware templates for configurable processors. Hence, we extend their methodology for applications involving control flow and compare with our proposed approach. For each test case, templates are identified using these three approaches and are used to cover the DFGs of a CDFG, individually. These templates are potential candidates for hardware acceleration, and are hence termed as non-trivial templates (N T )(multiple nodes). The uncovered operations of all DFGs are identified as trivial templates (single node) and are executed on a software processor (Microblaze). As there is no mention on the number of instances of N T templates, only one instance of each template is assumed to be present in the circuit. Unlike [13] and [14] , the hardware templates used in [15] are assumed to have non-overlapped execution, because only one application specific instruction corresponding to a template can be executed at any given time. For every N T template, the corresponding DFG is scheduled using FDS and a resource set {n r | r ∈ R} is obtained. The area cost of a single N T template k is shown in (12) , where W r is the relative area cost (WSDP) of resource r. The overall area cost is computed as summation of the area costs of all the m N T templates as shown in (13) . In the presence of trivial templates, area cost of the Microblaze (A static ) is added. Fig. 6 shows the relative area cost and execution time of architectures generated by our proposed approach and the three template based approaches, for all the test cases. For each test case the available device primitives used at the beginning of the algorithm are shown in Table VII . From Fig. 6(a) it can be seen that the architectures generated by the proposed methodology have the least relative area cost for all the test cases except for IDCT, when compared with Cong et al. [15] . This is because the number of distinct N T templates present in the graph is very small (though the number of templates required to cover the graph is large) and we assumed only one instance of each template to be present in the hardware which requires less circuit. (Note that the area cost for our architecture includes the resources for data routing (i.e., delay registers and multiplexers), whereas, the area cost for the architectures generated using other approaches include only the area cost of arithmetic/logic resources.) However, the penalty can be observed in the execution time difference. From Fig. 6(b) it can be observed that the execution time for architecture generated using Cong et al.'s method is 5072 cycles, whereas for the architecture generated using our proposed algorithm is only 600 cycles. That is to say, when compared with Cong et al.'s method, though our algorithm has an area overhead of 5%, savings of 88% can be observed in execution time. LUT  FF  DSP  BRAM  1  ADE  3000  2000  10  50  2  IDCT  6700  6050  20  50  3  SOP  3000  2000  10  50  4  SS  9000  11000  20  50  5  CB  7000  6000  20  50  6  AS  7000  6000  20  50  7 WT 6000 5000 20 50
V. CONCLUSION This paper presents a methodology to derive context adaptable architectures for FPGA that can support multiple DFGs contained in the CDFG of an application. A novel area metric (WSDP) that is based on the heterogeneous mixture of device primitives in an FPGA is proposed and is used to guide resource selection, when multiple implementations for a particular resource type are available. A context adaptable architecture (CAA) template is presented and a CAA exploration (CAAE) algorithm, which includes heuristic-based scheduling, resource selection, and mapping algorithms, is described in detail. Following are the key features of the CAAE algorithm: (i) Every instance of resource selection is aware of FPGA primitives that are remaining after all the prior resource selections (ii) Heuristics and estimation formulas reduce the design time and (iii) WSDP used as the cost function in resource selection.
Architectures generated by the proposed methodology are compared against those generated using other published techniques. The test cases used for benchmarking are obtained from multiple applications domains. Overall WSDP, execution times, and post P&R resource utilization (in terms of number of device primitives) are used as metrics for comparison. Proposed methodology outperformed the other published techniques by generating smaller (an average savings of 46% in WSDP) and faster (an average savings of 46% in execution times) architectures.
