ABSTRACT
INTRODUCTION
Reconfigurable devices mitigate many of the problems encountered with the development of Application Specific Integrated Circuits (ASICs) for hardware acceleration. For example, reconfigurable devices amortize the rapidly increasing mask and non-recurring engineering (NRE) costs over many more generic devices. Computer Aided Design (CAD) flows are often simplified for these de-vices. Thus, the design cycle is much reduced, which can significantly decrease the time to market.
The tradeoff for using these reconfigurable devices is a compromise in performance and most notably power/energy consumption. To reduce the energy consumption of a reconfigurable device, particular care must be given to designing both functional units and interconnect of the device.
Stripe-based fabrics in particular (e.g., see Figure 2 ) are quite promising due to their good fit to a data flow graph structure [5, 25, 8, 9] . When a data flow graph is mapped to a stripe-style structure, however, data dependency edges often traverse multiple rows. Mapping of a data flow graph onto a reconfigurable fabric is described in detail in Section 3.1. In these fabrics, arithmetic and logic units (ALUs) must often pass these values through without doing any computation. In other words, the of the signal and image processing applications, for example, that more than 50% of the functional units in the fabric were used for routing by configuring the ALU as a pass shown in Figure 1 [8] .
Figure 1: Comparison of ALUs used for routing and computation
However, these ALUs used as passgates are an area vertical routing. One alternative that has been studied is to use a simple routing struct could only pass a value, i.e., a dedicated pass order of magnitude more power than such a direct vertical route implementation. Previous research has found, for example, that an architecture that a provides 19% energy savings and 30% area savings [8] .
There are a variety of possible ways to route inputs in the coarse which approach is better because additional hardware must be understand how well it is utilized. To better understand the tradeoffs, we present a quantitative study of different architectures described briefly as follows. In this paper, we study (i) integrated constants (IC) approach where constants are loaded in the registers local to the functional units; (ii) inputs coming from the side (ICS) where both constants and variable inputs can be routed to the stripe directly where needed; (iii) ICS with extended vertical interconnect (IC a combination of dedicated pass gates (DPs) with standard, IC, ICS, and ICS styles. In the standard implementation, inputs are routed from the top of the fabric and functional units are used for passing information. This l functional units that could have been used for actual computations are being used for pass operation. Since there are many inputs that stay constant during the execution cycle, they can be loaded to the registers local to the functional units. This approach will use additional registers for loading constants but can save some functional units for being used only for passing information. It provides 22% area savings and 13% energy savings on an average architecture, we introduce small multiplexers to the inputs of each functional unit to provide flexibility to read inputs from the top row or directly from outside the fabric. We find that the use of such multiplexers allows substantial area savings through allowing smaller fabrics to carry the same benchmark suites. The small additional power and energy cost of the additional hardware is recovered easily through the fact that the overall fabric is smaller and fewer functional used as pass gates. This approach achieves 51% area Journal of VLSI design & Communication Systems (VLSICS) Vol. 4, No.1, February 2013 arithmetic and logic units (ALUs) must often pass these values through without doing any the ALU's function merely as pass-gates. It was observed for some of the signal and image processing applications, for example, that more than 50% of the functional units in the fabric were used for routing by configuring the ALU as a pass However, these ALUs used as passgates are an area-inefficient and power-inefficient method for vertical routing. One alternative that has been studied is to use a simple routing struct could only pass a value, i.e., a dedicated pass-gate. Using an ALU as a passgate requires over an order of magnitude more power than such a direct vertical route implementation. Previous research has found, for example, that an architecture that adds 50% DPs to an existing fabric provides 19% energy savings and 30% area savings [8] .
There are a variety of possible ways to route inputs in the coarse-grained fabrics. It is not obvious which approach is better because additional hardware must be configured for some, we must understand how well it is utilized. To better understand the tradeoffs, we present a quantitative architectures described briefly as follows. In this paper, we study (i) integrated ere constants are loaded in the registers local to the functional units; (ii) inputs coming from the side (ICS) where both constants and variable inputs can be routed to the stripe directly where needed; (iii) ICS with extended vertical interconnect (ICSa combination of dedicated pass gates (DPs) with standard, IC, ICS, and ICS-EV architecture styles. In the standard implementation, inputs are routed from the top of the fabric and functional units are used for passing information. This leads to the inefficient resource utilization because functional units that could have been used for actual computations are being used for pass operation. Since there are many inputs that stay constant during the execution cycle, they can be registers local to the functional units. This approach will use additional registers for loading constants but can save some functional units for being used only for passing information. It provides 22% area savings and 13% energy savings on an average over the baseline. In the ICS architecture, we introduce small multiplexers to the inputs of each functional unit to provide flexibility to read inputs from the top row or directly from outside the fabric. We find that the use ubstantial area savings through allowing smaller fabrics to carry the same benchmark suites. The small additional power and energy cost of the additional hardware is recovered easily through the fact that the overall fabric is smaller and fewer functional used as pass gates. This approach achieves 51% area savings and 27% energy savings over the SICS) Vol.4, No.1, February 2013 74 arithmetic and logic units (ALUs) must often pass these values through without doing any gates. It was observed for some of the signal and image processing applications, for example, that more than 50% of the functional units in the fabric were used for routing by configuring the ALU as a pass-gate as inefficient method for vertical routing. One alternative that has been studied is to use a simple routing structure that gate. Using an ALU as a passgate requires over an order of magnitude more power than such a direct vertical route implementation. Previous dds 50% DPs to an existing fabric grained fabrics. It is not obvious configured for some, we must understand how well it is utilized. To better understand the tradeoffs, we present a quantitative architectures described briefly as follows. In this paper, we study (i) integrated ere constants are loaded in the registers local to the functional units; (ii) inputs coming from the side (ICS) where both constants and variable inputs can be routed to -EV); and (iv) EV architecture styles. In the standard implementation, inputs are routed from the top of the fabric and functional eads to the inefficient resource utilization because functional units that could have been used for actual computations are being used for pass operation. Since there are many inputs that stay constant during the execution cycle, they can be registers local to the functional units. This approach will use additional registers for loading constants but can save some functional units for being used only for passing information.
over the baseline. In the ICS architecture, we introduce small multiplexers to the inputs of each functional unit to provide flexibility to read inputs from the top row or directly from outside the fabric. We find that the use ubstantial area savings through allowing smaller fabrics to carry the same benchmark suites. The small additional power and energy cost of the additional hardware is recovered easily through the fact that the overall fabric is smaller and fewer functional units are avings and 27% energy savings over the baseline. We extended the ICS approach by introducing multi-level vertical interconnect in the fabric. Now the functional unit can not only reach the functional units in the row above but can also reach the functional units in the grand-parent and great-grand parent rows in the same column. We use bigger multiplexers as compared to the ICS approach to provide that reachability but now we can implement the same benchmarks on even smaller fabrics. It provides 60% area savings and 27% energy savings over the baseline. In addition to these, we also studied the combination of adding dedicated vertical routes to the standard, IC, ICS, and ICS-EV techniques. Adding dedicated pass gates to these architectural options further increase the area and energy savings.
While our technique applies to stripe-based reconfigurable fabrics in general such as PipeRench ( [13, 14] ) and Kilocore ( [4] ), and conceptually to the larger class of coarse-grained reconfigurable fabrics, our technique is demonstrated using the low-energy domain specific fabric (DSF) target ( [5] ) shown in Figure 2 .
The remainder of this paper is organized as follows: Section 2 provides some background material in the area of reconfigurable computing and coarse-grain architectures in general. An overview of the fabric target used in this paper to demonstrate the impact of the inputs coming from the side is presented in Section3. Section 4 includes results and an analysis of energy consumption for a suite of benchmark circuits. Section 5 discusses conclusions.
BACKGROUND AND LITERATURE REVIEW
A tremendous amount of effort has been devoted to the area of reconfigurable computing for application acceleration with custom hardware. While FPGAs are the most commonly used general purpose reconfigurable devices, they exhibit poor power characteristics.
Recently, the development and use of coarse-grained fabrics for computation-ally complex tasks has received a lot of attention as a possible alternative to FP-GAs. [1] ), and the coarse-grained architectures devel-oped by [2] , [27] .
MATRIX (Multiple ALU architecture with Reconfigurable Interconnect eX-periment) [3] is comprised of a two-dimensional array of identical 8-bit functional units with a configurable network. Each functional unit consists of a 256x8-bit memory, an 8-bit ALU and a control logic. The Garp [23] , the Chimaera [28] , the MorphoSys [24] , and the SuperCISC [20] architectures combine a reconfigurable computing device with a processor in order to do hardware acceleration. RaPiD (Reconfigurable Pipelined Datapath) [7, 15] , mainly intended for computation-intensive applications, consists of a linear array of application-specific functional units. PipeRench [13, 14] , Kilocore ( [4] ) have a striped configuration and is comprised of an interconnected network of configurable logic blocks and storage elements. It consists of a set of physical pipeline stages called stripes and each stripe contains a set of processing elements, register files, and an interconnec-tion network. The CFPA (Computational Field Programmable Architecture) [11] consists of Partial Add, Subtract, and Multiply (PASM) blocks for implementing data path operations of computational intensive applications. The PASM block operates on 4-bit operands and can be connected together to im-plement adders, subtracters, and multipliers of various sizes. The HFPGA (Hi-erarchical Field Programmable Gate Array) [10] allows the creation of coarse grain blocks built from traditional 4-input lookup tables. These coarse grain blocks have dedicated routing channels. ADRES ( [18] ) implemented and eval-uated several inter-connection topologies that includes simple mesh and more complex schemes, where one functional unit can transmit data to non-adjacent functional units in the same row or non-adjacent functional units in the same column. Pact XPP Technologies [21] propos architecture, which has a hierarchical array of coarse Processing Array Elements (PAEs) and a packet core is comprised of a rectangular array of ALU reconfigurable fabric architectures have sequential structure and use local registers or shared register files for storing data values. Of these, PipeRench and Kilocore are stripe grain fabrics. These fabrics used pas from one stripe to the other. [22] describes how to manage short coarse-grained fabrics. They discuss various architectural options for storing values when optimizing for area and energy. They consider constants as long register files. In this paper, we present a detailed energy and area analysis of various architectural techniques including integrated constants, inputs coming form th approach with extended vertical interconnect and the combination of dedicated pass gates with standard, IC and ICS. Dedicated pass gates are also incorporated to reduce the usage of functional units as pass gates to pass co (consumer) (especially when the consumer is separated by multiple stripes from the producer).
In our previous research, we studied the impact of varying different design parameters such as the width of the functional units, homogeneous vs. heterogeneous functional units, various functional unit implementation techniques, granularity of the interconnect, interconnect patterns, horizontal and vertical routing onto physical characteristics like [5, 25, 8, 9] . We attempted to minimize the cardinality of the operations supported by each ALU, and maximize the use of dedicated pass gates in the fabri We observed that even with all of the remain and results appear to be area approaches in this paper. To our knowledge, no one has yet presented a sys of input routing alternatives as considered in this paper.
DOMAIN SPECIFIC FABRIC
Stripe-based hardware fabrics are application onto the device. We architecture shown in Figure 2 , although a similar approach could be used for other stripe architectures. adjacent functional units in the same column. Pact XPP Technologies [21] propos architecture, which has a hierarchical array of coarse-grained adaptive computing elements called Processing Array Elements (PAEs) and a packet-oriented communication net-work. An XPP core is comprised of a rectangular array of ALU-PAEs and RAM-PAEs with I/O. These reconfigurable fabric architectures have sequential structure and use local registers or shared register files for storing data values. Of these, PipeRench and Kilocore are stripe-based coarse grain fabrics. These fabrics used pass register files to manage constants and pass computed values from one stripe to the other. [22] describes how to manage short-lived and long-lived values in grained fabrics. They discuss various architectural options for storing values when zing for area and energy. They consider constants as long-lived values and store them in register files. In this paper, we present a detailed energy and area analysis of various architectural techniques including integrated constants, inputs coming form the side, the hybrid of IC and ICS approach with extended vertical interconnect and the combination of dedicated pass gates with standard, IC and ICS. Dedicated pass gates are also incorporated to reduce the usage of functional units as pass gates to pass computed values from one stripe (producer) to another stripe (consumer) (especially when the consumer is separated by multiple stripes from the producer).
In our previous research, we studied the impact of varying different design parameters such as the h of the functional units, homogeneous vs. heterogeneous functional units, various functional unit implementation techniques, granularity of the interconnect, interconnect patterns, tical routing onto physical characteristics like power, performance, and area [5, 25, 8, 9] . We attempted to minimize the cardinality of the interconnect and the number of operations supported by each ALU, and maximize the use of dedicated pass gates in the fabri all of the optimizations a very large number of ALUs as pass remain and results appear to be area-inefficient, which motivates the idea of exploring alternative approaches in this paper. To our knowledge, no one has yet presented a sys-tematic exploration input routing alternatives as considered in this paper.
ABRIC OVERVIEW
are designed to easily map data flow graphs (DF e illustrate our results by modifying the domain specific fabric architecture shown in Figure 2 , although a similar approach could be used for other stripe lived values and store them in register files. In this paper, we present a detailed energy and area analysis of various architectural e side, the hybrid of IC and ICS approach with extended vertical interconnect and the combination of dedicated pass gates with standard, IC and ICS. Dedicated pass gates are also incorporated to reduce the usage of functional mputed values from one stripe (producer) to another stripe (consumer) (especially when the consumer is separated by multiple stripes from the producer).
In our previous research, we studied the impact of varying different design parameters such as the h of the functional units, homogeneous vs. heterogeneous functional units, various functional unit implementation techniques, granularity of the interconnect, interconnect patterns, and power, performance, and area interconnect and the number of operations supported by each ALU, and maximize the use of dedicated pass gates in the fabric.
optimizations a very large number of ALUs as pass-gates inefficient, which motivates the idea of exploring alternative tematic exploration FGs) from the domain specific fabric architecture shown in Figure 2 , although a similar approach could be used for other stripe-based The fabric model was implemented in parameterized VHDL using the generic capability of the VHDL language. The fabric size is determined with the parameters specifying the width of the fabric W and height of the fabric H . W dictates the number of ALUs in each computational stripe. H determines the number of computational and interconnection stripes in the fabric model shown in Figure 2 . The fabric architecture also has several early exit rows, spaced evenly in the device. For example, for a fabric with height 18, every alternate row is connected to the exit row. As soon as the output is computed, it can be sent to the nearest exit row which is connected to the final output of the device. If the output is available in row 9, it will go the nearest exit row 10 and then go the final output of the device. This saves a significant number of functional units in the successive rows being used to pass outputs down the rows.
Mapping of applications onto domain-specific reconfigurable fabric
A mapping of a data flow graph (DFG) onto a reconfigurable fabric consists of an assignment of operators in the DFG to ALUs in the reconfigurable fabric such that the logical structure of the DFG is preserved and the architectural constraints of the fabric are followed. This mapping problem is very critical to the use of the fabric because a mapping solution must be available each time the fabric is reprogrammed for a specific DFG. Because of the layered nature of the fabric, the mapping is also allowed to use ALUs as pass-gates, which take a single input and pass the input value to one or more outputs. In general, not all of the available ALUs and edges will be used. An example DFG and a corresponding mapping are shown in Figure 3 and Figure 4 . The DFG from Figure 3 is implemented on a baseline architecture where inputs and constants are routed from the top of the fabric. ALUs used as operators are shown in white colored squares with operators marked in them, ALUs used as pass gates are shown in blue color and labeled as "P". The inputs and outputs are shown in white colored ovals. Consider an ALU in row 11 and column 10 i.e. ALU (11, 10) , shown in yellow color, one of its inputs is a constant and is being routed all the way from the top of the fabric. It uses 10 ALUs for just passing this input to the desired location. Obviously, routing alternatives for passing input values are needed.
This DFG has two outputs, one of which is computed and available very early in the fabric (in row 4). Because of early exit rows in the fabric, this output can come out directly to the final output without using any ALUs in the successive stripes for the pass operation.
Architectural exploration case studies
In order to conduct architectural exploration case studies, we selected a set of core signal processing benchmarks from MediaBench benchmark suite includ-ing the ADPCM encoder (enc), ADPCM decoder (dec), GSM channel encoder (gsm), and the MPEG II decoder (row, col). We added the Sobel (sob) and Laplace (lap) edge detection algorithms to the benchmark suite. We computed the number of operations and number of constants in each benchmark. Table  1 shows the number of operations and the number of constants contained in the benchmark suite. Operations include only regular arithmetic, logic and shift operations such as addition, multiplication, AND, OR, right-shift, etc. It also shows the number of pass gates required to pass inputs and constants to the functional units where they are needed in the baseline architecture. As it can seen that a large of functional units are being wasted for routing inputs and constants. For example, in "enc", 105 pass gates are used to route only 3 inputs and 14 constants. 
Fabric architecture with dedicated pass gates (DP)
In order to reduce power consumption due to large numbers of ALUs being used as pass gates, the use of dedicated pass gates, which simply route data vertically from one row to the next have been explored ( [8] ). The dedicated pass gate can also be set to idle state when not being used. Figure 5 shows the data flow graph (DFG) from Figure 3 mapped onto the architecture with 33% DPs (1 out of 3). ALUs used as operators are shown in white colored squares with operators marked in them, ALUs used as pass gates are shown in blue color and labeled as "P", the dedicated pass gates are shown in green color and are labeled as "DP", and the white empty squares are idle. Our goal here is to minimize the usage of ALUs for pass operations. As it can be seen that the number of ALUs used as pass gates shown in blue color have been reduced from the baseline architecture but there are still many ALUs which are being used for pass operation. 
Fabric architecture with integrated constants (IC)
To implement the Integrated Constants (IC) architecture, we used a register to store a constant and a 2:1 multiplexer for each operand of an ALU as shown in Figure 6 . Each multiplexer can take inputs from the stripe above and from a register. The first stripe of ALUs in the fabric architecture takes variable inputs from the top and constant inputs from the registers; the ALUs in the rest of the stripes can get their operands either from the predecessor stripe or from the register. Figure 7 shows the DFG shown in Figure 3 mapped onto the architecture where constants are routed directly to the functional units where needed using registers. In order to keep the figures simple, we show the constants integrated inside the ALUs and variables are in bubbles off to the sides. Constants are labeled within an ALU as "LC" and "RC". "LC" stands for a left constant and it means that the left operand of the ALU is a constant. "RC" stands for a right constant and it means that the right operand of the ALU is a constant. The same graph which used 16x14 standard fabric is using only 13x14 fabric with IC. It requires 19% fewer functional units to implement the same DFG onto the fabric with IC than the standard implementation. 
Fabric architecture with
To implement the ICS architecture, we used a 2:1 multiplexer for each operand of an ALU as shown in Figure 8 . Each multiplexer first stripe of ALUs in the fabric architecture takes all inputs from the top. No multiplexers are needed for the first ALU stripe. The ALUs in the rest of the stripes can get their operands eit from the predecessor stripe or from the side. Each stripe has two busses, one for the left operand and one for the right operand. Inputs are stacked in a single multi bus, and required inputs are selected from this v Figure 9 shows the DFG shown in Figure 3 mapped onto the architecture where inputs are routed directly to the functional units where needed. The same graph which used 16x14 standard fabric is using only 4x14 fabric with ICS. It requires 75% fewer functional units to implement the same DFG onto the fabric with ICS than the base 
rchitecture with inputs coming from side (ICS)
To implement the ICS architecture, we used a 2:1 multiplexer for each operand of an ALU as shown in Figure 8 . Each multiplexer can take inputs from the stripe above and from the side. The first stripe of ALUs in the fabric architecture takes all inputs from the top. No multiplexers are needed for the first ALU stripe. The ALUs in the rest of the stripes can get their operands eit from the predecessor stripe or from the side. Each stripe has two busses, one for the left operand and one for the right operand. Inputs are stacked in a single multi-bit signal that is sent along the bus, and required inputs are selected from this value by the left or right multiplexer. Figure 9 shows the DFG shown in Figure 3 mapped onto the architecture where inputs are routed directly to the functional units where needed. The same graph which used 16x14 standard fabric is using only 4x14 fabric with ICS. It requires 75% fewer functional units to implement the same DFG onto the fabric with ICS than the baseline. To implement the ICS architecture, we used a 2:1 multiplexer for each operand of an ALU as can take inputs from the stripe above and from the side. The first stripe of ALUs in the fabric architecture takes all inputs from the top. No multiplexers are needed for the first ALU stripe. The ALUs in the rest of the stripes can get their operands either from the predecessor stripe or from the side. Each stripe has two busses, one for the left operand bit signal that is sent along the Figure 9 shows the DFG shown in Figure 3 mapped onto the architecture where inputs a constants are routed directly to the functional units where needed. The same graph which used 16x14 standard fabric is using only 4x14 fabric with ICS. It requires 75% fewer functional units to 
Fabric architecture with
To implement this architecture, we used a 4:1 multiplexer for each operand of an ALU as shown in Figure 10 . Each operand can come either from the stripe above, the grandparent stripe ALU(same column), the great grandparent stripe ALU(same column), or fr 
Fabric architecture with ICS with extended vertical interconnect (ICS
To implement this architecture, we used a 4:1 multiplexer for each operand of an ALU as shown in Figure 10 . Each operand can come either from the stripe above, the grandparent stripe ALU(same column), the great grandparent stripe ALU(same column), or from the 
ICS with extended vertical interconnect (ICS-EV)
To implement this architecture, we used a 4:1 multiplexer for each operand of an ALU as shown in Figure 10 . Each operand can come either from the stripe above, the grandparent stripe Each stripe has two busses, one for the left operand a stacked in a single multi-bit signal that is sent along the bus, and required inputs are selected from this value by the left or right multiplexer. Figure 11 shows the DFG shown in Figure 3 mapped onto the architec constants are routed directly to the functional units where needed. The same graph which used 16x14 standard fabric is using only 3x14 fabric with hybrid approach. It requires 81% fewer functional units to implement the same DFG onto side and the 4:1 multiplexer provides this flexibility and reachability. The first stripe of ALUs in the fabric architecture takes variable inputs from the top and constant inputs from the registers. Each stripe has two busses, one for the left operand and one for the right operand. Inputs are bit signal that is sent along the bus, and required inputs are selected from this value by the left or right multiplexer. Figure 11 shows the DFG shown in Figure 3 mapped onto the architecture where inputs a constants are routed directly to the functional units where needed. The same graph which used 16x14 standard fabric is using only 3x14 fabric with hybrid approach. It requires 81% fewer functional units to implement the same DFG onto this new architecture compared to the baseline. EV architecture.
side and the 4:1 multiplexer provides this flexibility and reachability. The first stripe of ALUs in the fabric architecture takes variable inputs from the top and constant inputs from the registers.
nd one for the right operand. Inputs are bit signal that is sent along the bus, and required inputs are selected from ture where inputs are constants are routed directly to the functional units where needed. The same graph which used 16x14 standard fabric is using only 3x14 fabric with hybrid approach. It requires 81% fewer this new architecture compared to the baseline.
Fabric architecture with
To implement this architecture, we used a 4:1 multiplexer for each operand of an ALU as shown in Figure 12 . Each operand can come eith right ALU(same row), or from the side and the 4:1 multiplexer provides this flexibility and reachability. The first stripe of ALUs in the fabric architecture takes variable inputs and constants from the top and results from the neighbor ALUs. Each stripe has two busses, one for the left operand and one for the right operand. Inputs are stacked in a single multi along the bus, and required inputs are selected from this value by t Figure 13 shows the DFG shown in Figure 3 mapped onto the architecture where inputs and constants are routed directly to the functional units where needed. The same graph which used 16x14 standard fabric is using only 6x6 functional units to implement the same DFG onto this new architecture compared to the baseline. 
RESULTS
We performed detailed area and energy analysis on various architectural options including standard, IC, ICS, ICS-EV, and a combination of dedicated pass gates with these approaches.
Journal of VLSI design & Communication Systems (VLSICS)
Vol.4, No.1, February 2013
Fabric architecture with ICS with horizontal interconnect (ICS-HI)
To implement this architecture, we used a 4:1 multiplexer for each operand of an ALU as shown in Figure 12 . Each operand can come either from the stripe above, the left ALU (same row), the right ALU(same row), or from the side and the 4:1 multiplexer provides this flexibility and reachability. The first stripe of ALUs in the fabric architecture takes variable inputs and constants he top and results from the neighbor ALUs. Each stripe has two busses, one for the left operand and one for the right operand. Inputs are stacked in a single multi-bit signal that is sent along the bus, and required inputs are selected from this value by the left or right mulitplexer. Figure 13 shows the DFG shown in Figure 3 mapped onto the architecture where inputs and constants are routed directly to the functional units where needed. The same graph which used 16x14 standard fabric is using only 6x6 fabric with hybrid approach. It requires 84% fewer functional units to implement the same DFG onto this new architecture compared to the baseline. : A DFG shown in Figure 3 mapped on the ICS-HI architecture.
We performed detailed area and energy analysis on various architectural options including EV, and a combination of dedicated pass gates with these approaches. To implement this architecture, we used a 4:1 multiplexer for each operand of an ALU as shown er from the stripe above, the left ALU (same row), the right ALU(same row), or from the side and the 4:1 multiplexer provides this flexibility and reachability. The first stripe of ALUs in the fabric architecture takes variable inputs and constants he top and results from the neighbor ALUs. Each stripe has two busses, one for the left bit signal that is sent he left or right mulitplexer. Figure 13 shows the DFG shown in Figure 3 mapped onto the architecture where inputs and constants are routed directly to the functional units where needed. The same graph which used fabric with hybrid approach. It requires 84% fewer functional units to implement the same DFG onto this new architecture compared to the baseline. HI).
HI architecture.
We performed detailed area and energy analysis on various architectural options including EV, and a combination of dedicated pass gates with these approaches. Table 2 provides a summary of the size requirements of the seven signal and image processing benchmarks mentioned in Section 3.2 mapped to various fabric architecture styles. The fabric size is given by Width x Height. When we compare the various architecture alternatives with the baseline, the benchmarks can fit in smaller width fabric. The benchmarks with more number of constants such as "enc", "dec", "col", and "gsm" show large area improvements. For example, "gsm" implemented on standard fabric with 33% DPs was using 16-wide fabric whereas the same benchmark when implemented on the fabric with ICS takes only 3-wide fabric.
Once all benchmarks were mapped to a fabric using a particular architecture, the fabric size was fixed to the smallest size that could fit all seven benchmarks. The benchmarks can be mapped onto smaller size fabric for ICS architectures as compared to the standard architectures as shown in Table 4 , 5, 6 and 7. For example, the benchmarks implemented on standard architecture with no DPs used 20x18 size fabric whereas the same set of benchmarks can now be implemented on 9x18 fabric with ICS. Table 8 shows the percentage savings in terms of number of functional units per benchmark mapped onto standard, IC, ICS, and hybrid architectures. We computed the number of functional units required to map each benchmark for a particular architecture. We then compared every architectural option with our reference baseline architecture to obtain savings. The IC architecture requires 27% fewer functional units compared to the baseline. The ICS architecture provides savings of 52% in terms of functional units compared to the standard architecture. The ICS-EV architecture requires 62% fewer functional units than the baseline architecture. The combination of ICS and 50% DPs needs 64% fewer functional units as compared to the baseline. Using the parameterized fabric model described in Section 3, we generated various instances of fabric architectures. We synthesized the fabric VHDL into Synopsys cell-based ASIC design with a feature size of 90 nm using Synopsys Design Compiler. Figure 14 shows the area consumption of standard, dedicated pass gates, ICS, and hybrid architectures having both ICS and DPs. The hybrid architecture with ICS and 50% DPs consumes least area. This architecture provides 61% area savings compared to the standard architecture with no DPs. We also examined the utilization of ALUs for pass operation for various fabric architecture implementations. In Table 9 , we compare std, IC, ICS, ICS-EV, and a combination of DPs with these techniques. The number of ALUs used as pass gates has been reduced significantly when we compare the architectures having a combination of ICS and DPs with the baseline architectures. Consider the case of "gsm", when we mapped this benchmark onto the standard fabric with no dedicated pass gates, 139 out of 360 ALUs were being used for pass operation. When we added 33% dedicated pass gates to the architecture, the number of ALUs being used as pass gates was reduced to 46. When we introduced ICS also in the fabric, ALUs are no longer required for passing values down in the fabric. Even in the hybrid architecture with IC and ICS and extended vertical interconnect, only 2 functional units are used for passing information.
We also conducted energy simulations on the architectures discussed in this paper. The energy results are shown in Figure 15 . For each architecture, we compute energy for all the benchmarks examined and then compute average consumption over all the benchmarks. The combination of ICS and DPs consume least energy consumption. The hybrid-EV architecture also shows similar average energy consumption as ICS and DPs combination. This architecture does not use any dedicated pass gates to pass information from producer to the consumer. Instead it has extended vertical interconnect that increase the reachability of the functional units. The energy savings results for different fabric instances are shown in Table 10 . Energy was calculated by computing the product of the power and delay of the design. To calculate the power and delay of the design, the fabric VHDL is synthesized into Synopsys cell-based ASIC design with a feature size of 90 nm using Synopsys Design Compiler. The post-synthesis design was simulated in Mentor Graphics ModelSim to calculate the delay of each design and these simulations were used as stimulus to the Synopsys PrimeTime-PX tool to estimate the power consumption of the device. The fabric with IC provides energy savings of 13% as compared to the standard fabric. With ICS, we achieved energy savings of 27% as compared to the baseline architecture. By having a combination of DPs and ICS, we achieved energy savings upto 32% as compared to the baseline architecture averaged over all benchmarks. ICS-EV fabric provides energy savings of 27%. Table 11 shows the percentage area savings. IC fabric provides 22% area savings as compared to the standard implementation. The ICS and ICS-EV architectures achieve 51% and 60% area savings respectively. The combination of ICS and DPs provides upto 62% energy savings when compared with baseline. Figure 16 shows the energy and area savings for various fabric architecture implementations for a suite of signal and image processing applications exam-ined here. For each architecture, we show the energy and area savings that we achieve over the baseline architecture averaged over all the benchmarks. The ICS option achieves more energy and area savings as compared to the IC architecture. When ICS is combined with DPs, the level of energy and area improvements get even higher. The same level of area and energy savings can also be achieved using ICS-EV style. 
CONCLUSIONS
In this paper, we discussed various styles of routing constants and variable inputs in a stripebased coarse grained reconfigurable fabric including (i) integrated constants (IC) approach where constants are loaded in the registers local to the functional units; (ii) inputs coming from the side (ICS) where both constants and variable inputs can be routed to the stripe directly where needed; (iii) ICS with extended vertical interconnect (ICS-EV); and (iv) a combination of dedicated pass gates (DPs) with standard, IC, ICS, and ICS-EV architecture styles. We implemented these architecture styles using 90 nm ASIC process from Synopsys. We performed a detailed area and energy analysis on these architectures using signal processing benchmarks from Mediabench benchmark suite and some of the image processing applications. We observed that the fabric with ICS and 50% DPs is the best among these options, providing 31% energy savings and 62% area savings over a baseline architecture for our benchmark set.
