As wire delay increasingly becomes a significant performance bottleneck in monolithic architectures, there is a strong motivation to move to Dataflow Architectures. In this paper, we propose a set of placement algorithms for generic dataflow architectures. Our timing-driven and profile-driven placement algorithms respectively are targeting streaming and non-streaming applications. Compared to the conventional wirelength-driven algorithm, our timing-driven placer reduces the longest path delay by 23% and maximum slack by 26% at the cost of 10% increase in wirelength for streaming applications. In addition, our profile-driven placer reduces the total execution time of non-streaming applications by 17%. Lastly, our simultaneous timing/profile-driven placer reduces the total execution time of non-streaming applications by 13% on average.
I. INTRODUCTION
As wire delay increasingly becomes a significant performance bottleneck in monolithic architectures with their centralized structure requiring fast communication over long distances there is a strong motivation to move to Dataflow Architectures. Dataflow computing is a computing paradigm with an abundance of computing units that can be statically or dynamically configured to match the computing requirements of the given application. Dataflow architectures distribute their ALUs, storage and communication paths over a 2-dimensional grid and enable enormous parallelism in computation and communication by eliminating complex centralized control. They fire operations into ALUs as soon as the required input operands became available. The results are then routed to other ALUs waiting on them. Wire delay has a reduced impact since only neighboring architectural entities are allowed to communicate within a single clock cycle. This allows dataflow architectures to be extremely scalable compared to the traditional von Neumann architecture.
Streaming applications are particularly well suited to execute on dataflow architectures. Such applications tend to have very regular and fixed communication patterns that can be statically mapped to the dataflow grid. There is the potential to achieve very high performance for such applications on a dataflow architecture. A number of dataflow architectures that target streaming applications are being developed by the industry and the research community. These include TRIPS [1] and MONARCH [2] . In addition, there has been an ongoing research effort that targets non-streaming applications such as TRIPS [1] and WaveScalar [3] .
II. PRELIMINARIES

A. Target Architecture
In this paper, two types of architecture have been studied. The first is MONARCH [2] structure, and the second is an extension of MONARCH that includes some features from WaveScalar [3] . MONARCH, developed by USC/ISI and Raytheon Corporation, is a polymorphic architecture consisting of groups of arithmetic and memory clusters structurally tiled onto a surface with RISC processors and DRAMs located on the sides of the chip. Intra-and inter-cluster interconnection allow communication between both neighboring components and non-neighboring components. Connections are made based on the compilation result of the dataflow graph. The model of computation is also defined by the compiler, which provides streams and kernels that operate on the application.
Wavescalar [3] architecture is based on the dataflow architecture paradigm. However, the von Neumann memory system is preserved such that programs written for von Neumann machines can be recompiled and run on Wavescalar architectures. The tag system from the dataflow concept is adopted here such that different instances of the same variable can run concurrently. This architecture targets non-streaming applications and has been shown to minimize execution time. Figure 1 shows an illustration of the mapping of a dataflow graph (leftmost) to a dataflow fabric (middle). Squares in the dataflow fabric represent arithmetic clusters and circles represent memory clusters. To support Wavscalar feature, in each ALU, the input operands contain a tag field which must be matched in order for the ALU to execute the instruction. Therefore, this extension requires additional buffers, comparison hardware, and extension of the operand field to include tag assignments. Along with additional hardware, additional instructions are inserted by the compiler to handle this extended feature [4]. A tag generation mechanism is included for the modifications made to the ALUs. Enabling this tag system allows many iterations of a loop to be executed in parallel whenever there are available resources. Throughout our experiment, we assume that each arithmetic cluster consists of eight ALUs and four multiplexers, whereas memory cluster consists of four memory nodes and four multiplexers.
B. Dataflow Graph
Dataflow graphs are used to identify which instructions are the producers of data needed by other instructions. Loops in a program are captured as cycles in the DFG. After the compiler constructs a DFG, each node in the DFG is mapped to some processing element on the dataflow architecture grid. General processing elements consist of ALUs and input buffers to store operands. Processing elements can execute and communicate in parallel subject to dataflow constraints captured by the DFG.
III. PROBLEM FORMULATION
A. Design Flow
An application is first fed into the system by the frontend compiler. The dataflow graph is then generated. Highlevel machine independent optimization is performed here. Then the low-level optimization is invoked during back-end compilation. We modify the Trimaran compiler [4] to target dataflow architecture such that the placement reads its annotated dataflow graph, performs placements, and annotates the placement solution back to the assembler. Then the assembler is used to generate the binary for a given architecture. Note that the architecture description is read by compiler, placer, and assembler such that minor architecture modifications can be done without system modification. Statistic information of the given application is extracted from the front-end compiler and available for other parts of flow.
B. Dataflow Graph Placement Problem
We model a dataflow graph using a graph
¥ is the set of dataflow nodes and © is the set of dataflow edges. Let path denote the set of nodes in ¥ such that two successive nodes are connected with an edge in © and there is only one primary input and primary output in . We model dataflow architecture using a graph £ § ! . is the set elements (the smallest unit) of the dataflow machine. Each node " $ # % has location information assigned to it. If
are in the same cluster, they will have the same location information.
is the set of wires. A Dataflow Graph Placement solution is a one-to-one mapping ¥ 2 1 3 such that the mapping from
# 7 satisfies type constraints. There are three types of dataflow nodes: compute nodes, memory nodes, and switch nodes. There are three types of machine elements: arithmetic elements, memory elements, and multiplexing elements. The mapping is limited from compute nodes to arithmetic elements, memory nodes to memory elements, and switch nodes to multiplexing elements.
Our objective function includes wirelength, longest path delay, maximum slack, and total execution time. Wirelength is calculated as the half perimeter of the bounding box among each edge in ¢ . Longest path delay is the maximum arrival time among all primary output nodes in ¢ . Maximum slack is the maximum slack value among all edges in ¢ . Maximum slack is proportional to the size of the buffer used to store data when one or more inputs are not yet available. Hence reducing the maximum slack decreases the complexity of the machine required to run streaming application. The total execution time is explained in the following section.
C. Execution Time Estimation
Since our dataflow computer has no speculation, we estimate the total execution time of a given application as follows. The estimation is based on the number of times each node and edge are executed. For each path , the execution time of , denoted
, is computed as follows:
denotes the access frequency collected by profiling.
During profiling, high level simulations are performed on C source codes. By running the application on sample input sets, statistic information, such as how many times each path is executed, is collected. This statistic information is then annotated back into each edge in DFG. Then the total execution time is estimated as
In other words, we compute the weighted longest path delay where frequency information is used as the weight on each edge. Thus, a topological ordering based £ ' timing analysis is enough to compute the total execution time. To compute the longest path, cycles have to be removed first. However the back edges are included during execution time computationwe add the delay of source gate and the delay of the back edge to all paths that contain this back edge.
To evaluate our placement result, we assume perfect memory and large enough input queue buffers. Note that Wavescalar [3] also assumes perfect L1 data cache and unbounded input queues. Assuming that dataflow architectures are governed by parallelism, performance can be calculated based on how many times each path is executed as well as its delay. We use a training input set for profiling and a reference input set for performance evaluation.
IV. PLACEMENT ALGORITHM
We present three simulated annealing based algorithms for dataflow graph placement: timing-driven, profile-driven, and simultaneous timing/profile-driven placement. Timing slack is used to compute the weighted wirelength in timing-driven placement. In profile-driven placement, frequency information provided by the compiler is used to compute the weighted wirelength. Lastly, a combined cost function between timingdriven and profile-driven placement is used.
A. Timing-Driven Placement
We begin by removing the back edges that result in cycles in the dataflow graph. We compute the arrival time of node in the given DFG as follows:
The arrival time of all dataflow input nodes is set to 0. Then we traverse forward to calculate arrival time from the dataflow input node to compute
l be the longest path delay, which is the maximum
among all dataflow output nodes (= node with no outgoing edge). We compute the required time of node in the given DFG as follows:
To compute the required time, all dataflow output nodes are set to l . Then we traverse backward to calculate the required time from the dataflow output node. Timing slack of node ' is then computed
. We compute the slack of an edge as follows: 
B. Profile-Driven Placement
The motivation for our profile-driven placement is to reduce the total execution time of the application. Profiling based optimization has been studied in [5] . In [5] , prioritizing of most frequently executed blocks is followed by performing aggressive optimization on them. Optimization is done such that more emphasis is placed on instruction scheduling in hot spot locations. For dataflow computing, by using speculative execution [1] is wirelength. At the beginning of each temperature, timing slack is computed for timing-driven placement and remains fixed during that temperature. Our cost function minimizes weighted wirelength, which is calculated based on the target objective: timing-driven, profile-driven, or timing/profile-driven. We use initial temp = 10000, final temp = 100, and cooling rate = 0.95.
V. EXPERIMENTAL RESULTS
Our placement algorithms were implemented using C++, compiled by gcc 3.2.2 and run on Pentium IV 2.4 GHz dual processors. Tables I illustrates the characteristics of our MediaBench and SPEC2000 benchmark applications. We report (i) the name of the function that is selected, (ii) the number of compute nodes (#co), total nodes (#nodes), and edges (#edges) in each dataflow graph, and (iii) the dimension of the arithmetic clusters in terms of width and height (#aclusters). For the MediaBench benchmark, the most frequently executed function was selected for evaluation. For the SPEC2000 benchmark, the function that most captured benchmark behavior reported in [6] was selected. Throughout the experiment, wirelength was measured in terms of relative Manhattan distance. Delay and total execution time was measured in terms of number of cycles and hundreds of million cycles respectively (the clock cycle is fixed throughout our experiment). The runtime was measured in terms of seconds.
The comparison of our timing-driven placement to wirelength-driven placement is shown in Table II . We use streaming applications from MediaBench. The wirelength (wire), longest path delay (dly), maximum slack (slk), and runtime are reported. Our timing-driven placement increases wirelength by 10% on average. However, the improvement on longest path delay and maximum slack are 23% and 26% on average, respectively. Thus, it is evident that our edge weight based timing-driven placement is effective in reducing both the longest path delay and maximum slack. The runtime overhead is minimal since our £ '
timing analysis is performed only once during each temperature level. slk  wire  dly  slk  cjpeg  223  51  35  247  35  20  djpeg  19965  77  52  21067  64  57  gsmde  10595  59  53  11451  43  39  gsmen  12432  72  68  13084  49  45  mpeg2d  10765  69  56  11605  48  42  mpeg2e  10773  102  93  12025  79  69  rawc  441  55  45  511  53  34  rawd  270  47  34  315  37  21  RATIO 1.00 1.00 1.00 1.10 0.77 0.74 TIME 31659 34239 Table III illustrates the results for a non-streaming application on the extended architecture. We report wirelength (wire), longest path delay (dly), and total execution time (exec) discussed in Section III-C. We also report the total elapsed CPU time of all benchmarks. Note that runtime is the time for running our simulated annealing and evaluating the result only. It does not include the compile time used by the compiler. First, our timing-driven algorithm is about 16% better than wirelength-driven in terms of longest path delay. However, wirelength and runtime are increased by 19% and 43%. Surprisingly, timing-driven placement does not obtain a better execution time compared to wirelength-driven. This is because the longest path does not always imply the most frequently executed path in the program. An aggressive minimization on such a path may even degrade total execution time.
Second, our profile-driven placement improves the total execution time by 17% compared to wirelength-driven placement. However, wirelength, longest path delay, and runtime are increased by 27%, 1%, and 48%, respectively. We note that the range of profile-based edge weights is quite huge: [0,10000]. On the other hand, the range of timingbased edge weights is much smaller. This in turn causes a huge wirelength penalty for our profile-driven placement. The runtime overhead includes the time to acquire frequency information from compiler. Third, our hybrid timing/profiledriven placement improves the total execution time by 13% compared to wirelength-driven placement. It also improves both the wirelength and longest path delay over the profiledriven placement.
VI. CONCLUSIONS
In this paper, we proposed a compiler-driven placement framework that supports a generic dataflow architecture. We presented timing-driven placement for a streaming dataflow architecture and profile-driven placement for a non-streaming dataflow architecture. We also proposed a simultaneous timing/profile-driven placement that combines the benefit of both approaches.
