The ability of a conipiler to exploit looplevel parallelisnr in a reconfgurable array is signi$cantly a8ected by the amoiint of flexibility in the interconnect architecture. A lessflexible interconnect will make it more diflcult for the compiler to find eflcient loop-level pipelined schedules, leading to reduced instruction throughput, and larger configuration bit storage area. In this papm, we detemiine the optimumflexibility and topology for apointto-point interconnect architecture in a reconfgurable system. We present four topologies, and show that their performance per unit area is signifcantry better than that that would be obtained i f a fully-connected network had been used.
Introduction
Today's multimedia applications require more processing power than ever before. This processing power can be supplied by standard processors. digital signal processors, application-specific standard products. or reconfigurable systems. Of these, reconfigurable systems provide a balance between design time, performance, customizability, power dissipation, and cost. These advantages have motivated many academic and commercial recodigurable systems [I-121. The heart of any reconfgurable system is a reconfigurable fabric. upon which highly parallel parts of an application can he executed. Recontigurable fabrics can be classified as fme-grained. in which the basic unit of computation on the fabric is a lookup-table, or coarse-grained, in which the basic unit of computation is larger, such as an arithmeticilogic unit (ALU). Compared to fme-grained architectures, coarse-grained architectures have a smaller reconfiguration overhead, leading to lower power, faster achiexzable clock speeds, and a more predictable mapping.
Coarse-grained architectures share much in common with both fme-grained FPGAs and with general-purpose (multifunction unit) processors. Like a processor, a coarsegrained architecture consists of basic blocks which are ALUs which operate on several bits at a time. Like a fmegrained FPGA, the "personality" of each &U in each time step can be configured using configuration bits (just as the configuration hits in a lookup -table indicate In this paper, we investigate the architecture of the network that connects coarse-grained functional units (CFU's) withim a reconfgurable fabric.
This interconnect architecture is important. Ideally, each CFU will be performing useful work every clock cycle. To achieve this, data must be supplied to the inputs of each CFU in a timely fashion. In addition, the pattern of data transfers will change every cycle. Providing paths for this data transfer for all CFU inputs on the chip requires a flexible interconnect. On the other hand, if the network is too flexible, it will consume a large area on the chip, reducing the number of CFU's that can be integrated onto the fabric. as well as increasing the power dissipation.
The interconnect architecture withim a reconfgurable system has been studied before. Compton has studied the automatic generation of segmented interconnect architectures in 1-D reconfigurable systems [13] . Unlike her work, we focus on 2-D fabrics. and consider point-topoint networks. In [ 141, Bansal also studied the routing architecture in a reconfigurable system, but considered only small hand-placed list-scheduled benchmarks, and did not quantify the area impact of the proposed architectures. Also unlike both of these previous works, our architecture is intended to support loop-level parallelism (software pipelining), d i c h places significantly more constraints on the routing architecture.
To make our results concrete, we focus on the ADRES architecture. which was developed to accelerate the digital signal processing requirements of multimedia systems [15].
ADRES consists of a VLIW processor tightly coupled to a reconfigurable fabric containing a heterogeneous grid of CFU's. Section 2 contains a description of the ADRES architecture, as well as the DRESC tool which maps programs to the device [16]. Section 3 will describe the interconnect architectures we considered. and Section 4 will evaluate these architectures experimentally. 
Context
In this section. we set the stage by describing our architectural assumptions and the mapping technology used for our architectures.
a) Architectural Framework
Our baseline architecture consists of a confgurable array coupled with a general-putpose VLIW processor [15] . Highly pipeline-able loops are identified and executed on the configurable array, and sequential code is executed by the processor. The processor and the reconfigurable unit communicate via a global register file.
The reconiigurable fabric consists of a 4x4 heterogeneous array of confgurable functional units (CFU's), as show in Figure l (a). Each CFU can perform a subset of forty-five 32-bit operations in each cycle. All CFU's can perform a variety of arithmetic (signed and unsigned). logic, shilling, and comparison instructions. The CFU's in the third column can also perform signed and unsigned multiplication. Finally, the CFU's in the top row can access data from a shared register file. thereby interacting with a general-purpose CPU.
Figure I(b) shows the structure of each CFU. The CFU receives two data inputs and one predicate input. and performs one of several functions, producing a single 32-bit data output. and two predicate outputs (predicate signals are single-bit signals that can be used to remove branches *om pipeline-able loops). In addition. each CFU contains one 4-entry 32-bit data register file, and one 4-entry single-bit banked predicate register file.
The input multiplexer select lines, the function performed by the functional unit, and the select and ~t e -e n a b l e lines for the register files are controlled by configuration bits. We assume a multi-context device, in which the behaviour of each CFU can change from cycle-to-cycle (we will vary the number of contexts, as described in Section 4). Unlike a traditional processor, the number of contexts is small (in the experiments in this paper, we rarely need more than 16 contexts). Thus, it is possible to store all contexts on chip, and select between them using a multiplexer. This means we can change the context each cycle, just as a processor can change the operation executed by an ALU each cycle. We assume that once the configuration bits are loaded. they do not change throughout the implementation of a loop (so. if there are 16 contexts, only one of the 16 will be used at any time throughout the execution of the entire loop).
The CFU's are interconnected using a flexible point-topoint network. Section 3 will describe the networks considered in this paper.
b) Mapping Technology
This section describes the CAD flow that maps programs onto our coarse-grained architecture. The IMPACT tool is used as a front-end to parse C code, do some optimization and analysis, and emit a data-flow graph [17] . Highly parallelizable loops withii the code are then identified, and the DRESC tool uses a modulescheduliig algorithm to schedule each loop on the target architecture, assigning each operation in the loop to a specific CFU and a specific time slice [16] . The DRESC tool takes into account the characteristics of the target architecture, including the capabilities of each CFU and the interconnect between the CFVs; this ensures that the resulting schedule can be implemented on the reconfigurable device. In essence, the DRESC tool performs the tasks of scheduling, placement, and routing simultaneously. An example fiom [16] illutrates the scheduling problem, and provides insight into the interconnect requirements in the reconfgurahle fabric. Consider the implementation of the dataflow graph in Figure 2 (a) on the fabric of Figure  2 (b) . If this dataflow graph represents one iteration of a loop that is executed many times, the execution of subsequent iterations can be overlapped in time as shown in Figure 3 (in Figure 3 , the architecture is drawn as a 1-D array for clarity; time is the vertical axis). At time t=O, CFU 3 executes operation N1 for iteration 1. At time t=l, CFU's 1 and 4 execute operations N2 and N3 for iteration 1, while CFU 3 starts iteration 2 by executing operation N1. Potential interconnect paths between the functional units are shown as dotted and solid lines in Figure 3 ; those interconnects that are actually used to transfer data are shown as solid. Note that. in this example, the configuration of each CFU does not change over time; instead, the data is transferred to the appropriate CFU each cycle. This suggests that this interconnect architecture must be very flexible.
In some cases. it may not be possible to schedule a new iteration to start every cycle. This may be because of data dependencies, because there are not enough CFUs, or because the interconnect is not rich enough to make the required data transfers. The Iteration Interval (11) of a schedule is defmed as the number of cycles between the 11=1 I initiation of consecutive iterations. In Figure 3 . II=l. In this example. the instruction executed by each CFU does not change over time. thus, one context is sufficient for storing the configuration of each multiplexer and functional unit control bits. In general. an architecture with I1 contexts are required to implement a schedule [16] , where a context is a complete set of configuration hits needed to store the values of all multiplexer and functional unit select lines in the fabric. DRESC begins the scheduling task by attempting to schedule the dataflow graph on an architecture with II=I. If this is not successful. it increases I1 by one, and repeats. until a successful mapping is found. In this way. DRESC fmds the smallest value of I1 that can be used to implement the dataflow graph.
c) Interaction between Iteration Interval, Execution time, and Routing Architecture
Intuitively. the more flexible the routing architecture, the smaller the value of I1 for h i c h DRESC will find a valid mapping solution. One of the constraints DRESC must respect when scheduling is the available interconnect; these constraints are less onerous if the interconnect architecture is more flexible. A smaller value of I1 will lead to a smaller chip area. since the number of contexts that must be present on the chip is equal to 11. On the other hand.
this must be balanced with the fact that a more flexible interconnect will require more chip area.
The flexibility of the interconnect can affect the execution time of algorithms in two ways. A more flexible interconnect will likely be slower, since larger routing multiplexers and potentially longer wires are required. This effect will be small, however, since, in a coarsegrained array, the cycle time of the CFU's themselves is significantly larger than the interconnect delay (this is different than fme-grained FPGA's, where the interconnect delay dominates). A more important impact on speed is that, as the flexibility of the interconnect is increased, the achievable I1 decreases. From the example of Figure 3 , it is clear that the smaller the value of 11, the higher the throughput of the program running on the architecture. Thus, a more flexible routing architecture will tend to lead to faster execution time.
Candidate Interconnect Architectures
The connections between the outputs of each functional unit and the inputs of other functional units can he implemented using direct links, segmented interconnect schemes (as in an FPGA) [13], or using a more complex a) Example for FinPr = 3 network [18] . In this paper, we limit ourselves to point-topoint direct links; that is. if the output of functional unit x can drive the input of functional unity, there is a dedicated, un-shared wire between functional unit x and functional unity. Each functional unit can drive multiple sources. A multiplexer is used for each data input to select a signal to pass onto the functional unit (see Figure l(b) ). The investigation of the effectiveness of non-point-to-point networks (where some segments witbin the network are shared) is left for future work.
A direct point-to-point interconnect architecture can be characterized using two parameters. The first parameter is the flexibility of the architecture. We quantify the flexibility as the number of choices for each functional unit input. In this paper, we denote this quantity F,,,,,.
In Figure l( The second parameter is the topology of the interconnect. Given that each input multiplexer can select one of F,w,, sources, the topology dictates which F, , , , signals (tiom among all the outputs of the other functional units) can be selected.
We have identified several interconnect topologies, as described below. Each topology can be thought of as a family of architectures; each member of the family has a different value of F,,,,.
In the fust topology, which we call "Closest", each functional unit can he driven by the outputs of the F2,,i,,, "closest" functional units.
An example is shown graphically in Figure 4 (a). In this example, each input of the shaded functional unit can be driven by one of two sources. Since each input can also be driven by the output of the shaded functional unit itself, F,,,,=3 in this example.
For other values of F, , , , , , Figure 4(b) shows the general pattem for the inputs to the shaded CFU. In the figure, each functional unit i is labeled with a label l,. An architecture with a given value of F,.m,,, can be constructed by connecting the output of all CFU's with label /, _C FZni,,' to one of the inputs of the shaded CFU. This is repeated for each CFU, following the same pattem shown in Figure  4 (b), with connections "Wrapping-around" the tophottom and edges. In this way, we can construct a "closest" topology interconnect for any value of Finplr.
The "Clique", "Directional". and "Heterogeneous" topologies are shown in Figure 5 . For each topology, a n interconnect with any value of F,,,,c can he constructed. as described above.
The "Clique" architecture is a c) Heterogeneous this paper generalization of the MorpboSys interconnect, in which each CFU can be connected to all CFUs in the same row or column [3] . In the "Directional" topology, each CFU can be driven by CFU's in previous rows (wapping around at the top). The "Heterogeneous" topology takes advantage of the fact that the CFU's in the third column can perform multiplication; this topology gives preference to connections to multipliers when F,np,, is low.
In Section 4, we will compare these topologies to each other, as well as to a "full" interconnect, in which every CFU output can be connected to every CFU input.
Experimental Comparisons
In this section, we experimentally compare the topologies described in Section 3, as well as seek the optimum value for F, , , .
We first consider the impact of the interconnect architecture on the number of contexts required to implement a kemel, and the instructions per cycle achievable by each fabric on each benchmark. Then, we show the impact on the number of contexts and F,,,,, on the area required to implement the fabric. Finally. we combine these results to fmd the best topology and value of F, , , , .
In all of these experiments. we used ten benchmark kemels derived from the C reference code of TI'S DSP and Mediabench benchmarks. Each benchmark kemel is a single loop containing between 18 and 184 operations per iteration.
a) Intpact of Interconnect on the Nunrber of Contexts and Instructions per Cycle
For each topology, and each value of F,,,,. we used DRESC to fmd the minimum number of contexts (11) required to implement each of our ten benchmark kemels. Figure 6 (a) shows how 11 is affected by F,,p,, for the "Closest" topology. Each line in the graph represents one benchmark kemel. As expected, in general, a higher value of F,,,, (a more flexible interconnect) means fewer contexts are required. since the mapper has fewer constraints on where operations can be placed on the array. Figure 6 (b) shows the results averaged over all benchmark kemels for all five topologies (for the "full" topology, the parameter F>np., is not relevant. so a solid line is shown). From this graph, it is clear that the number of contexts drops as F,,p,, increases for all topologies, and that the number of contexts approaches the minimum number that can be obtained by a full interconnect as F,,,, grows beyond four or five. the value of II, these results follow those in Figure 6 closely. Again, it is clear that as F,,,, increases beyond approximately four, the IPC approaches what would be achieved with a full interconnect.
b) Impact of 11 and Finput on chip area Intuitively, as I1 and F,np,, increase, the chip area required by the reconfgurable fabric increases. Figure 8 shows this graphically. For each value of Flnp,, and 11, we synthesized a VHDL description of the architecture with Synopsys (using a 0 . 1 8~ TSMC process), and measured the area (previous work has shown that the post-synthesis area estimate has good correlation with the post-layout area had the synthesized circuit been taken through ,,,, physical design [19] ). In gathering the results in Figure 8 , the topology is immaterial; for a given value of F, , , , , all topologies will require the same post-synthesis area. This may not be true if the architecture was laid out by hand, since longer wires may consume more chip area, however, this would likely he a second-order effect which would not impact our results significantly.
c) Overall Results
By dividing the IPC measurements by the area estimates, we obtain the average numher of instructions per cycle per nun2 for each benchmark kemel on each architecture. The results, averaged for all kernels, are shown in Figure 9 . As can be seen, for each topology, the optimal value for F,,,, for the "closest", "heterogeneous", and "directional" topology is between four and six. Beyond that, the extra area required to implement the interconnect is not made up for by an increase in P C . Below the optimum value of F~.,,, the I P C /~' metric drops of quickly; for these low values of flexibility, DRESC is unable to find efficient implementations of each kernel, and thus must compensate by increasing the number of contexts (11).
For all architectures, once F,,,, rises above 3, the partially interconnected fabric is significantly better than the fullyconnected fabric. From Figures 4 and 5 , it may seem that the fully-connected fabric should he very similar to that when F,,,t is large (six or seven). The reason that this is not the case is that in the fully-connected architecture, each CFU output is connected to both inputs of every CFU, thus, Finp,,, = 16. This suggests that it is usually unnecessary to connect a CFU output to both CFU inputs.
From Figure 9 , we can also see that the "Heterogeneous", "Closest", and "Directional" topologies all show similar results. It is not meaningful to rank these topologies based on these results. The Tlique" architecture, however, is different. For F , , , , = 4 , the average IPC/nnn' is much lower than that for the other topologies. This point corresponds to an architecture in which each CFU can be driven hy all CFUs in the same row or column. One reason that this architecture performs worse than the other F,,,,=4 architectures is that, as described in Section 2, we assume that a single row of CFU's contain multipliers, while all other CFU's do not contain multipliers. In the "Clique" topology, all CFU's in a given row are tightly connected to each other, at the expense of connections between the multipliers and other CFU's. This has a negative impact on the ability of DRESC to find a valid schedule. To verify this. we modified the architecture such that the multipliers are "staggeres' along the diagonal of the array (this is equivalent to creating an interconnect pattern similar to "Clique" except that the vertical connections nm diagonally). The results obtained from this architecture for F,",,,=4 were close to those obtained from the other three topologies.
Conclusions
In this paper, we compared four different topologies for the interconnect architecture in a coarsegrained recontigurable array. We found that the ability of a compiler to exploit loop-level parallelism (software pipeliig) on such a device is significantly affected by the amount of flexibility in the interconnect architecture. A less flexible interconnect will make it more difficult for the compiler to find efficient loop-level pipeliined schedules, leading to reduced instruction throughput, and larger configuration bit storage area. Of our four topologies. we found that three gave very similar results. The fourth sufiered in that it did not connect the multipliers to the rest of the array efficiently. For all topologies, as the number of choices for each functional unit input grew larger than four or five, the overall performance per unit area of the architecture was significantly better than that that would be obtained if a fully-connected interconnect had been used. 
