Abstract-This paper presents a parameterizable, coarse-space exploration are presented in Section IV. Conclusions grained, reconfigurable fabric model that attempts to maintain and future work are discussed in Section V. [4][5][6]. However, these methods are either too technologyperformance due to the variation of different parameters such dependent or too architecture-dependent. Due to this drawas datawidth and interconnection flexibility has been studied.
I. INTRODUCTION
heterogeneous fabrics. They used the architectural processing use rate and the communication hierarchical distribution as Hardware acceleration using Field Programmable Gate Ar-metrics to investigate a power-efficient architecture. However, rays (FPGAs) has become increasingly popular for compu-our work is focused on the study of the impact of varying tationally intensive Digital Signal Processing (DSP) applica-different design parameters such as the data width of the basic tions. Unfortunately, while FPGAs have a reasonably tractable functional units, and the granularity of interconnect on power Computer Aided Design (CAD) flow and performance, they as well as performance of the reconfigurable architectures. have poor power characteristics when compared to direct
The proposed low-power fabric was designed to operate Application Specific Integrated Circuit (ASIC) fabrication. within the SuperCISC processor architecture. The SuperCISC However, ASICs require more complex CAD than FPGAs and processor was developed with a 4-way very long instruction large Non-Recurring Engineering (NRE) costs. word (VLIW) core with a shared register file [8] . The idea A reconfigurable device that exhibits ASIC-like power qual-is to accelerate the high incidence code segments (e.g. loops)
ities and FPGA-like costs and tool support is desirable to that require large portions of the application runtime, called fill this void. Several coarse-grained fabric architectures have kernels, while also accelerating the remaining non-kernel code been proposed during the last decade such as MATRIX [1] , with the VLIW. These kernels are converted into entirely comRaPiD [2] , PipeRench [3] . Most of these have been focused binational hardware functions generated automatically from on performance and area-efficient architectural techniques with the C using a design automation flow [8] . In order to create a the notable exception of reduced power consumption.
combinational hardware function, a technique called hardware This paper proposes an approach to reconfigurable fabric predication is employed to remove the need for sequential architectural space exploration with an emphasis on both per-logic. As a result, the Super Data Flow Graph (SDFG) can be formance and energy efficiency. A parameterizable reconfig-transformed into a combinational hardware implementation. urable fabric model is presented that allows design parameters to be adjusted within the architecture. The impact of varying III PARAMETERIZED RECONFIGURABLE FABRIC MODEL different design parameters such as the width of functional SDFGs retain a data flow structure allowing computational units, and the granularity of interconnect are studied for their results to be computed in one ALU and flow onto others in the implications on power and performance.
system. The proposed reconfigurable fabric model is designed The remainder of this paper is organized as follows: Sec-to mimic this computational style. As shown in Figure 1 , tion II describes relevant previous work particularly concen-ALUs are organized into rows or computational stripes within trating on the design space exploration of the reconfigurable which each functional unit operates independently. The results computational fabrics. The fabric architecture and configurable of these ALU operations are then fed into interconnection parameters are described in Section III. Results for design stripes constructed using multiplexers. The fabric model is 1-4244-0054-6/06/$20.OO ©2006 IEEE ALU(11)
ALU (21) ALU (2, The Hardware bar corresponds to the power results for the blocks independently synthesized for each operation. The decreasing to about 30% of the 32-bit version. There is ALU bar corresponds to a synthesizable ALU built struc-a similar power trend between 16-bit and 8-bit operations. turally from the Mentor Moduleware components. It exe- Figure 6 describes the latency of each bit-width. While as cutes each function in parallel and selects the result using expected, the latency is lowest for the 8-bit ALU operations a multiplexer after the computation completes. Finally, the the change is only a nominal decrease over a 16-bit ALU. Even Optimieed ALU represents a low-power ALU in which compared to a 32-bit ALU, the delay improvement is less than latches are used at the input to each operation to avoid 50% at the cost of 3/4 of the bandwidth for the computation. unnecessary switching of the rest of the hardware blocks when
The energy results of ALUs of different datawidths are only a single operation is executed. The optimized ALUs are shown in Figure 7 . This chart includes the power consumption used as computational elements of the fabric, of using a 32-bit wide ALU to compute both 16-bit and 8-bit The datawidth of each functional unit has a significant values in comparison to computing them directly on a 16-bit impact on power dissipation of the fabric. Thus, 8, 16 and or 8-bit wide ALU. This was done to calculate the energy 32-bit ALUs, which are candidates to be used as computa-consumption overhead of a 32-bit ALU used for lower bit tional elements, have been power profiled for several ALU width operations operations.
Consider the dedicated width case (from left to right, the 100 the.group.of.2:1,.4:1,. Figure 3 , it is desirable to utilize 4:1 multiplexers D 32-bit ALU U 16-bit ALU 32-bit ALU (16-bit) U 8-bit ALU U 32-bit ALU (8-bit) wherever possible. Even though results suggested to use lower cardinality multiplexers, fixing the multiplexer cardinality was V. CONCLUSION AND FUTURE WORK In this paper, we described a generic and parameterized using a 32-bit ALU for 16-bit operations as compared to fabric model that exhibits ASIC-like power characteristics a 16-bit ALU is 2.5X on the average. The same trend is and FPGA-like programmability and tool support. The design observed if we compare the energy results of a 32-bit ALU space exploration results suggested the use of power optimized used for 8-bit operations and a 8-bit ALU. However, when 32-bit width computational elements interconnected by low the multiplier is removed from consideration, the overheads cardinality multiplexers like 4:1 multiplexers. are much lower. While, it appears that the overhead actually Our planned future work is to develop a mapper that can decreases for some of the logic operations, this is unlikely, handle limited cardinality routing. By doing so, we expect to and the difference between the two calculations does not further improve power and performance results. exceed the estimated inaccuracy from using the PrimePower
50%_
Remanence and scalability," 
