This paper describes several system-level interconnection strategies for a coarse-grained reconfigurable fabric designed for low-energy hardware acceleration. A small, representative sub-graph for signal and image processing applications is used to predict the success ofmapping larger applications onto the fabric device with these different interconnection strategies, which include 32:1, 8:1, 5:1, 4:1, 3553:1 (3:1, 5:1, 5:1, 3:1) and 355:1 (3:1, 5:1, 5:1) cardinalities. Three mapping techniques are presented and used to complete mappings onto several of these fabric instances including a mixed integer linear programming technique, a constraint programming approach, and a greedy heuristic. We present results for area (in number of required rows), power, delay, and energy as well as run times for mapping a set of signal and image processing benchmarks onto each of these interconnects. Our results indicate that the 5:1 interconnect provides the best overall results and does not require any additional hardware resources than the baseline 4:1 technique. When compared with other implementation strategies, the reconfigurable fabric energy consumption, using 5:1-based interconnect, is within 5-lOX of a direct ASIC implementation, is lOX better than an Virtex II Pro FPGA and is lOOX better than an Intel XScale processor 1
Introduction
Reconfigurable devices mitigate many of the problems encountered with the development of Application Specific Integrated Circuits (ASICs) for hardware acceleration. For example, reconfigurable devices amortize the rapidly increasing mask and non-recurring engineering (NRE) costs over many more generic devices. Computer Aided Design (CAD) flows are often simplified for these devices. Thus, the design cycle is much reduced, which can significantly decrease the time to market. The tradeoff for using these reconfigurable devices is a compromise in performance and most notably power/energy consumption. To reduce the overhead of using a reconfigurable device, particular care must be given to designing the system-level interconnect of the device. It is the interconnect that largely determines the power and performance characteristics of the device (see Section 2.1).
This paper describes a system-level interconnect prediction strategy for the SuperCISC low-energy reconfigurable fabric target. In previous work, the fabric architecture was developed based on multi-bit functional units and multi-bit routing structures configured through multiplexers [13] . As part of the previous work, an architectural design space exploration was completed that determined parameters for the functional units and routing. However, the structure of the routing was considered only briefly. In this paper, a commonly recurring graph structure from applications in the multimedia and signal processing domains that is difficult to map onto the fabric is used to drive the interconnect strategy. This graphical structure is applied to several interconnection strategies within the fabric architecture to help determine their viability for larger applications.
The remainder of this paper is organized as follows: Section 2 provides some background material on the Super-CISC project and the mapping concept including a study of related work within these areas. The system-level interconnect architectures and evaluation strategies are described in detail in Section 3. Section 4 presents the three mapping strategies employed in this work. Power, delay, and energy results are presented in Section 5. Section 6 discusses several conclusions and considers future work.
2 Background and Literature Review 2.1 System Overview While FPGAs are the most commonly used general purpose reconfigurable fabrics, they exhibit poor power characteristics. The dynamic power consumption in FPGAs has been shown to be dominated by interconnect power. For example, the reconfigurable interconnect in the Xilinx Virtex II FPGA consumes more than 70% of the total power dissipated in the device [16] . Contributing to the static power consumption of FPGAs are the SRAMs for programming the state of the device. This is exacerbated by the necessity of bit-level control for the computational and switch blocks. Additionally, because the device is designed to handle sequential logic, clock trees and storage registers are required, which also contribute to power consumption. Thus, to create a low-power computational fabric, it is desirable to remove or reduce as many of these power consuming characteristics as possible.
The SuperCISC low-power fabric was designed to operate within the SuperCISC processor architecture summarized in [10] . The idea is to accelerate the high incidence code segments (e.g. loops) that require large portions of the application runtimiie, called kernels, while assigning the control-intensive portion of the code to a core processor. These kernels are converted into entirely combinational hardware functions generated automatically from the C using a design automation flow [9] . Using hardware predication, a Control Data Flow Graph (CDFG) can be converted into a Super Data Flow Graph (SDFG) [9] . SDFG based hardware functions operate asynchronously from the processor core. Also by removing the sequential logic makes the hardware fabric for implementing the SDFG much simpler.
Due to certain assumptions of the SuperCISC flow, such as entirely combinational hardware, original computation from C implying 8, 16 , and 32-bit operation granularities, it is possible to reduce high power characteristics of an FPGA for a more efficient reconfigurable device. Moving to multibit functional units significantly reduces routing complexity and leads to a lower power device. Removal of sequential logic eliminates clock trees and local storage also contributing to power reduction. Thus, the interconnect prediction in Section 3 begins with these assumptions. Unlike MATRIX [14] whose basic functional unit consists of an 8-bit ALU and a SRAM, the basic functional unit in our fabric is a coarse-grained ALU having variable data width. However, for this paper we fix the ALU data width to be 32-bits. Our approach differs from GARP [8] insomuch as we tailor the hardware co-processor to the application domain. Compared to RaPiD [4] , which has smaller RAMs and registers to store data and intermediate results, our fabric is purely combinational. The programmable connections in the data-path interconnect in our fabric are modeled as multiplexers somewhat similar to those in RaPiD. Unlike RAP [5] whose ALUs are arranged in a chess board style, the fabric model used in our research has a striped configuration like that of PipeRench [15] but without register files.
Reconfigurable Fablric Target
Several methods have been proposed in the past few years for design space exploration of reconfigurable architectures [1, 2, 6] . However, these methods are either too technology-dependent or too architecture-dependent. Bossuet, et al [3] proposed a design space exploration method that can be used to cover a wide domain of reconfigurable fabrics. They used the architectural processing use rate and the communication hierarchical distribution as metrics to investigate a power-efficient architecture. In contrast, our work studies the impact of varying the interconnect strategy based on a representative sub-graph to reduce the power/delay of coarse-grained reconfigurable architectures.
Mapping Problem Statement
In order to use our fabric with a given benchmark circuit, it is necessary to map the circuit onto the fabric. Such a mapping consists of an assignment of operators in the circuit to ALUs of the fabric such that the logical structure of the circuit is preserved and the parameters of the fabric are respected, particularly the width, height, and interconnect design. This mapping problem is central to the use of the fabric, and we consider it in several forms described as follows. Our approaches for solving the mapping problems appear later in Section 4. All of the problems assume a fixed fabric width and interconnect design.
Minimum Size Mapping: In order to reduce power consumption, it is desirable to use as few rows in the fabric as possible. Given a fabric width, fabric interconnect design, and circuit to be mapped, the Minimum Size Mapping problem is to find a mapping that uses the minimum number of rows in the fabric. The mapping may use pass gates as necessary.
Feasible Mapping: Given a fully-specified fabric and a circuit, Feasible Mapping is the problem of identifying any mapping that preserves the logical structure of the circuit and respects the fabric parameters. Note that for a given height, width, and interconnect design, some circuits may have no feasible mapping.
Feasible Mapping with Fixed Rows: One of the more complicated parts of creating a mapping is the introduction of pass gates to fit the row-wise structure of the fabric. A successful approach that we have used is to work in two stages. In the first stage, pass gates are introduced heuristically and operators are assigned to rows so that all edges go from one row to the next. The second stage assigns the operators to columns so that the fabric interconnect is respected. This second stage is called Feasible Mapping with Fixed Rows. Note that depending on the interconnect design, there may or may not exist such a feasible mapping.
Optimal Graded-Cost Mapping: In this problem, we assume that a full interconnect is available in the fabric, but we assign different costs to each ALU depending on what size of interconnect it needs for its inputs. Interconnects with larger fan-in are given a higher cost than those with lower fan-in, on a graded scale. The problem is to find a mapping with minimum total cost. Our reason for investigating this problem is to see how often large interconnects are needed and to gain insight into possible interconnect designs. We will solve this problem and show the results in Section 3.2.
3 System-level Interconnection A fundamental problem that we explore in this paper is the Fabric Design Problem: Given a set of representative benchmarks, determine a good fabric design. This is a subjective problem in that it may involve tradeoffs between the interconnect design and the size of the fabric.
Benchmark Driven Interconnect
Architectural innovations are often developed in somewhat of a vacuum, without consideration of the tools needed to program the architectures and without consideration of the needs of the applications to run on the architectures. For a reconfigurable fabric such as the architecture described here, the temptation is to create an interconnect that has a regular structure. However, depending on the needs of the applications to be implemented, this regular structure may not be appropriate.
As we describe in Section 2.2 our reconfigurable fabric consists of a multiplexer-based interconnect. Based on our initial design space exploration studies, a 4:1 interconnect was selected to create enough flexibility in the routing while reducing power consumption and delay in the fabric [12, 13] . An example of this interconnection is shown in Figure 3 . Each operand of the ALU has a multiplexer in this configuration, including the third operand for the predica- The connectivity shown in Figure 6 is a compromise that allows an emulation of a 5:1 multiplexer without increasing the architectural complexity beyond 4:1 multiplexers.
In this case, the three internal ALUs, 1-3, are shared on both operand's multiplexers. The outermost ALUs, 1 and 5, are only available on the left and right operand multiplexer, respectively. The rationale for this is that if an operand is placed to the far left or right ALU, the other operand cannot occupy the same space, thus there is no conflict for the resource. The biggest limitation to this approach is that for non-commutative operations such as subtract, there is some restriction as to which operand may be retrieved from the far left or far right
As shown in Figure 7 , it is relatively easy to embed the graph from Figure 4 (a) into the connectivity provided by While we have demonstrated that our 5:1 multiplexer is an effective interconnect for signal and image processing applications, it may be possible to further optimize this strategy. We mapped our benchmark set to a fabric with fully interconnected stripes (e.g. any ALU from a previous stripe could be connected to any ALU in the current stripe).
Using a mixed-integer linear programming (MILP)l formulation that provided an increasing penalty to using 5:1, 9:1, and 17:1 routes or higher, we created the multiplexer usage statistics in Figure 8 . The IP technique eliminated need for nearly all multiplexers greater than 5:1 for most of the cases.
Consider the goal to replace one-third of the 5:1 multiplexers with 3:1 multiplexers, built from mirrored 2.1 multiplexers Now consider a structure where we have two 5:1 multiplexers adjacent, followed by a 3:1 which is repeated We call this a 355 1 interconnect The graph from Figure 4 (a) can be directly mapped. 
Fig-4 Mapping Strategies
As discussed in Section 2.4, a central problem in working with the architecture we discuss here is that of mapping a circuit onto the fabric. In this section we discuss the approaches we used to solving the various forms of the mapping problem.
Mixed Integer Linear Program
MILP is a common modeling and solution technique for combinatorial optimization. We used MILP to solve Feasible Mapping with Fixed Rows as well as for Optimal Graded-Cost Mapping. The objective function used in MILP is to minimize the number of required edges that are outside the interconnect design. If the MILP formulation finds a solution with objective value zero, then the solution is a valid mapping for the given interconnect design. If the optimal objective value is greater than zero, then there is no feasible mapping for that interconnect design. We used CPLEX 9.0 to solve the MILP. The details of the IP formulation can be found in [11] [11] .
Greedy Heuristic
The Greedy Heuristic Mapper follows a top-down mapping approach to provide a Feasible Mapping for any given benchmark. Starting with the top row, it completely places each individual row using a limited look-ahead of two rows. After each row is mapped, the mapper will not modify the mapping of any portion of that row. While the limited information available to the mapper does not often allow it to produce Optimal Mappings or Minimum-Size Mappings, its relative simplicity provides a decidedly short runtime. By default it tries to map the given benchmark to a fabric with width equal to the largest individual row, and height equal to the longest path through the graph. Although the width is static throughout a single mapping, the height can increase as needed. The details can be found in [11] .
Results
In order to evaluate power and performance, a set of core signal processing benchmarks were selected from MediaBench benchmark suite including the ADPCM encoder (enc), ADPCM decoder (dec), GSM channel encoder (gsm), and the MPEG II decoder (row, col). We added the Sobel (sob) and Laplace (lap) edge detection algorithms to the benchmark suite. Using the SuperCISC compilation flow [9] , computational kernels were extracted for these applications and converted into SDFGs, which we used as our benchmark circuits. These SDFGs were then mapped to the fabric model using the IP program, constraint program, and greedy heuristic as described in Section 4. We targeted fabric hardware using 32:1, 8: 1, and 4:1 multiplexers. We also targeted 5:1, 3553:1 and 355:1 multiplexer-based interconnects as described in Section 3. A detailed study of varying other parameters of the fabric such as the bit-width the ALUs can be found in [13] . Table 1 provides a summary of the area requirements of the benchmarks mapped to the fabric using the different multiplexing cardinalities and using the previously mentioned mapping strategies. Table 2 summaries the runtimes required by the mapping algorithms for each multiplexing cardinality. All items marked '-' were not able to be mapped. Once all benchmarks were mapped using a specific mapping technique to a fabric with a particular interconnect, the fabric size was fixed to the smallest size that could fit all seven benchmarks.
In the subsequent sections we study the power, delay, and energy impact of varying the interconnect strategy within the fabric. To calculate the power and delay of the design, the fabric was synthesized into an Oki cell-based ASIC design with a feature size of 0.16 ,um using Synopsys Design Compiler. The post-synthesis design was simulated in Mentor Graphics ModelSim to calculate the delay of each design and these simulations were used as stimulus to the Synopsys PrimePower tool to estimate the power consumption of the device. Energy was calculated by computing the product of the power and delay of the design.
8:1 mappings
The size of the 8:1 multiplexer-based interconnection fabric was set to 20x1 8 as each of the three mapping strategies was able to map all seven benchmarks within a fabric of this size. Table 3 summarizes the power, delay, and energy results for mapping to this particular fabric instance. In most cases, the results are pretty consistent across all mappers.
5:1 mappings
While the greedy heuristic performs reasonably well for 8:1 mappings, it does not perform as well for 5:1 mappings. The first indication, is that while the constraint and IP programming approach require no increase over the 20x18 fabric used in the 8:1 case, the greedy heuristic requires a 20x19 device. This leads to a power and delay increase. It should be noted that this improvement comes at a cost of at best a lOX additional mapping time. The summary of all power, delay, and energy results are shown in Table 4 Table 5 . Because the IP and constraint programming solutions were unable to map several instances, a fixed size device was unable to be established and thus the power, delay, and energy results cannot be included. The results show that mappings to these interconnects were difficult, and that all mappers required some compromise. The greedy heuristic was forced to add too many rows that counteracted the savings due to the simplified interconnect. Figure 10 shows a comparison of the best energy results for each type of interconnect, be it from the constraint, IP, or greedy mapping approach. Based on these results, clearly the 5:1 interconnect is the best overall solution for energy.
Fabric Technology Comparison
To provide some context, we compare the "best" fabric architecture with 5:1 multiplexing interconnect to implement the design on other digital hardware technologies, shown in Figure 11 . Delay of the FPGA-based design was computed using post place-and-route simulation in ModelSim and power was estimated in Xilinx XPower using the results of the delay simulations. The delay of the XScale processor was calculated using the SimpleScalar ARM simulator and the XTREM [7] tool to estimate the power con- Table 4 . Power, delay, and energy results for a 5:1 multiplexer-based interconnect. < I < I < I < I < I < I Our planned future work is to investigate further different interconnect structure patterns for reconfigurable architectures. We also plan to improve the mapping techniques and by doing so, we expect to further improve power/performance results.
