Abstract -In current reconfigurable architectures, the interconnect structures increasingly contribute to the delay and power consumption budget. The demand for increased clock frequencies and logic availability (smaller area foot print) makes the problem even more important, leading among others to rapid elevation in power density. Three-dimensional (3D) architectures are able to alleviate this problem by accommodating a number of functional layers, each of which might be fabricated in different technology. Since power consumption is a critical challenge for implementing applications onto reconfigurable hardware, a novel power-aware placement and routing (P&R) algorithm targeting to 3D FPGAs, is introduced. The proposed algorithm achieves to redistribute the switched capacitance over identical hardware resources in a rather "balanced" profile, reducing among others the number of hotspot regions, the maximal values of power sources at hotspots, as well as the percentage of device area that consumes high power. For evaluation purposes, the proposed approach is realized as a new CAD tool, named 3DPRO (3D-Placement-and-Routing-Optimization), which is part of the complete framework, named 3D MEANDER. Comparing to alternative solutions, the proposed one reduces the percentage of silicon area that operates under high power by 63%, while it leads to energy savings (about 9%), with an almost negligible penalty in application's delay ranging from 1% up to 5%.
INTRODUCTION
For decades, semiconductor manufacturers have been shrinking transistor size in integrated circuits (ICs) to achieve the yearly increases in speed and performance described by Moore's Law. These gains exist because the RC delay was negligible in comparison to signal propagation delay [1] . For submicron technology, however, the RC delay becomes a dominant factor. This has generated many discussions concerning the end of device scaling as it was known before, and has hastened the search for solutions beyond the perceived limits of current 2D systems.
The Three-Dimensional (3D) architectures mitigate many of the limitations introduced by existing design methodologies. Among others, they provide: (i) higher logic density in the same foot print area, (ii) shorter interconnections among the functional blocks, (iii) reduced signal propagation delay, (iv) greater versatility and resource utilization, and (v) lower power consumption.
One of the most critical challenges for implementing designs in 3D Field Programmable Gate
Arrays (FPGAs) is the power management, and hence the thermal problem, which has already been studied for 2D architectures [3, 6, 7, 17, 18] . This problem is exacerbated in 3D architectures for two reasons: (i) the vertically stacked layers cause a rapid increase of power density [9] and (ii) the thermal conductivity of the dielectric inserted between device layers for insulation is very low compared to silicon and metal.
According to "A-power" law [3] , the power density will continue increasing in feature technologies. A side-effect of this increase is the higher on-chip temperatures. Thermal management of FPGA devices is more critical compared to ASIC solutions, as they exhibit increased power dissipation. Additionally, the leakage current increases exponentially with temperature, causing a positive feedback loop. Consequently, the power consumption issue needs to be considered during every stage of implementing applications onto 3D FPGAs.
Thermal management of reconfigurable architectures is critical, mainly for three reasons:
a) The FPGAs exhibit high power dissipation while their operating temperature often exceeds the critical device temperature (in the absence of elaborate cooling mechanisms).
b) The trend regarding high power dissipation, and consequent the thermal stress, is going to become more severe for fabrication technology at 65nm and below.
c) The leakage current increases exponentially with temperature, causing a positive feedback loop between leakage power and temperature.
Power consumption of FPGAs is generally grouped into three categories: i) dynamic power, ii) static power, and iii) interface (I/O) power. These components are governed by the process technology and traditionally maintain constant percentages of the device's total power. However, the dynamic part of power consumption (formulated in Equation (1)), which occurs due to signal transition (as the load capacitance is charged or discharged), still dominates the total power dissipation. In this equation, represents the clock frequency, is the supply voltage, while and are the capacitance and switching activity values, respectively, of hardware elements that form network .
When a lower bound on the supply voltage is set by external constraints (as it often happens in realworld designs), or when the performance degradation due to lower the supply voltage is intolerable, then the only means of reducing power consumption is by managing appropriately both the effective capacitance and the switching activity (i.e. switched capacitance). Throughout the paper, we discuss an algorithm for managing the spatial distribution of this product · , leading to a more "uniform" power profile over the 3D FPGA.
Eliminating and managing power consumption for reconfigurable architecture requires appropriate algorithmic support. Realizing applications on 2D FPGAs is a well studied problem [6, 10, 18] ; however, there are only a few solutions focusing on 3D architectures [8, 13, 14, 16] .
In [13] a P&R approach for 3D ICs is presented, having as criterion to minimize the total wirelength, the applications delay, and the on-chip temperature. Even though the framework supports reconfigurable architectures, the thermal feature is available sorely for ASIC designs. A similar approach is shown in [14] , where the P&R algorithm optimizes the energy consumption and the thermal profile of a 3D standard-cell device under the supplied timing constraint. The employed algorithm focuses on the energy consumption of interconnect-related components. Unfortunately, the software implementation is not publically available, and hence it is not feasible to evaluate it against to our proposed solution. In [16] a thermal-driven 3D floor-planning algorithm that provides a tradeoff between runtime and quality is presented. The algorithm, compared to a non thermal-driven approach, tries to reduce the total wirelength, as well as the maximum on-chip temperature. In [8] a P&R algorithm and its software implementation targeting explore alternative heterogeneous interconnection schemes for 3D FPGAs are introduced. The employed cost functions pay effort to minimize the application's delay, the power/energy consumption, as well as the total wirelength.
Even though such an approach exhibits remarkable results, the spatial distribution of these parameters is not an issue.
A common limitation among all the relevant approaches from literature affects that they try to handle (or to minimize) the total power/energy consumption of the design, ignoring about the spatial distribution of their sources. This results in high thermal variation across the 3D device, leading among others to increased fabrication cost, as there is a need for more advanced packaging solutions.
In this paper, we propose a novel power-aware P&R algorithm, as well as its software implementation targeting 3D FPGAs. The proposed algorithm achieves similar performance gains compared to timing-aware P&R approaches, while it also alleviates some critical design issues related to the distribution of power sources. These gains are tightly firmed to the enhanced management of switched capacitance. More specifically, the proposed P&R algorithm achieves a rather uniform distribution of switched capacitance over the layers of the 3D FPGA (i.e. a more balanced temperature profile), as well as the reduction of power consumption's maximal values, without any performance degradation. Consequently, we achieve an identical application implementation, in a way that the percentage of area that operates under high power values (i.e.
exhibit high temperature values) is eliminated, resulting to increased device reliability.
The efficiency of the proposed power-aware P&R solution was evaluated against to existing (i.e.
timing driven) algorithms using the MCNC benchmarks [15] . Unfortunately, there is no other public available tool for application P&R onto 3D FPGAs that is aware of distributing the sources of power consumption sources. The results show significant reduction (about 63%) on area percentage that operates under high power (hotspots). More specifically, we use the term hotspot to refer to device area that consumes more than the 70% of the maximum power dissipation.
The rest of the paper is organized as follows. In Section 2 we describe the key features of the employed 3D FPGA for our exploration/evaluation procedure, while in Section 3 we describe the application encoding into a hypergraph form. The proposed algorithms for the P&R step are discussed in detail in Section 4, whereas Section 5 evaluates the power-aware algorithm against to alternative application implementation. Finally, conclusions are summarized in Section 6.
TARGET 3D RECONFIGURABLE ARCHITECTURE
For exploration/evaluation purposes, each of the benchmarks is mapped onto a 3D FPGA, where it's architecture is inspired by the one proposed in [8] . Such a device is constructed by stacking a number of identical 2D FPGAs on individual functional layers, while the appropriate communication among them is provided by Through-Silicon-Vias (TSVs). These TSVs are realized inside 3D Switch Boxes (SBs). The main architecture characteristics of the employed 3D FPGA are summarized, as follows:
 Each application is P&R onto the smallest 3D FPGA.
 For each application, the employed 3D devices between the timing-aware and power-aware P&R are identical.
 The 3D FPGA consists of four functional layers.
 The hardware resources (both logic and interconnection) among layers are identical.
 The percentage of 3D SBs per layer is 30% (as derived in [8] ).
 The spatial location , of each 3D SB per layer remains invariant.
 Every 3D SB consists of four wires (there is a 4 bit connection between adjacent layers per 3D SB).
 The technology for the employed TSV, known as 2 nd Generation TSV, is thought as the stateof-the-art solution [5] .
Even though in this paper we evaluate our proposed power-aware P&R algorithm with a specific 3D FPGA, it is platform independent, and consequently can be used for application mapping in any 3D reconfigurable architecture. However, as it was already shown in [8] , the employed device exhibits remarkable results (in terms of delay, power consumption and area utilization), as compared to alternative 3D FPGAs found in literature [13, 14] .
APPLICATION ENCODING
In order to handle applications with our proposed P&R algorithm, they have to be appropriately encoded into a directed hypergraph form, denoted as , . In this form, each vertex of the hypergraph represents the application's logic functionality implemented onto a logic block, while the directed hyperedge , corresponds to the communication link between the logic blocks and . In order to encode the application's functionality, apart from the hypergraph construction, we associate a number of weights to hyperedges and vertexs.
More specifically, for each of the hyperedge , , the weights define the communication bandwidth from vertex to (i.e. the number of wires between logic blocks and ), the power consumption, the timing criticality, as well as the electrical characteristics (in order to classify different wirelengths, TSVs, wires placed on the each layer, etc). Similarly, the weights of vertex represent the area occupied by each block, its power consumption, the switched capacitance criticality, as well as its delay.
Regarding the values of these weights are calculated, as follows:  Related to power consumption: We employ the models proposed in [2] , appropriately extended in order to be aware about the extra hardware of the 3D FPGA (i.e. 3D SBs).
 Related to delay: We employ the Elmore delay model [12] .
 Area: We use an area estimation model proposed in [20] .
POWER-AWARE PLACEMENT AND ROUTING ALGORITHM
Fundamentally, the problem of 3D P&R is related to topological arrangement of the application's logic functions to the available hardware blocks of the 3D FPGA, while satisfying the design timing, power and area constraints. As far as the timing constraints are concerned, the goal of our proposed P&R algorithm is to minimize the gradient of switched capacitance, in order to obtain a uniform distribution of power sources, resulting to a better temperature profile. This approach can be thought as a power management strategy, resulting to an application mapping with fewer hotspot regions, as compared to a conventional (i.e. non power-aware) implementation. More specifically, the hardware resources that belong to a hotspot region are defined as the device area that operates under higher than 70% of the maximum power consumption value. CAD tools from the corresponding 2D toolset [20] , which do not need to be aware of the thirddimensional FPGA topology (i.e. technology platform independent). To the best of our knowledge, this toolset is the first complete framework in academia starting from the hardware description language (HDL) up to configuration file generation.
In this paper, we propose a novel power-aware P&R algorithm that makes the application partitioning, placement and routing on the target 3D architecture. This algorithm (shown in Figure 3 ), was implemented within a CAD tool named 3DPRO, while it was applied as an extension to 3D
MEANDER Framework. The algorithmic complexity of this approach is similar to existing solutions [8, 10, 13, 14] . Detail description of the employed functions (i.e., partitioning, placement and routing) that compose our proposed P&R algorithm is provided in upcoming sections.
Regarding the power/energy consumption, the estimations were retrieved by employing models introduced in [2] , appropriately extended in order to be aware of the third dimension. These models are integrated in a new tool, named 3DPower. As far as our experimental results are concerned, the transition density is thought to be equal to 50% at the primary inputs. By calculating the transition density of each utilized hardware element, we retrieve the spatial distribution of switched capacitance over the 3D device. The term transition density refers to the average number of transitions appeared in a signal per unit time. Such an approach has proved to be faster and more reliable [10] , compared to simulation-based activity estimations.
Algorithm for Power-Aware Partitioning and Layer Assignment
The first step of the proposed power-aware P&R algorithm deals with the application partitioning to the device layers. The employed partitioning algorithm is based on FM (Fiduccia and Mattheyses) [21] , which was shown to be more efficient as compared to other candidate approaches (i.e. the Kernighan-Lin heuristic [22] ). The FM algorithm, rather than improving the current partition by swapping pairs of vertexs belonging to the two subsets of the current partition (as the Kernighan-Lin does), moves only one vertex at a time. Also, the consecutive moves are made in the opposite directions.
The implementation of FM algorithm onto [11] focuses on minimizing the edge-cut, while it also balances the size of the derived partitions. Even though such goals might be acceptable for partitioning VLSI designs targeting to mult-chip devices, it is not efficient for 3D architectures. This occurs due to the fact that the implementation [11] mainly targets to minimizing the edge-cut (number of connections among partitions). However, based on current fabrication processed [5] , it is feasible to build devices with high density of TSVs, in contrast to inter-chip communication medium that exhibits increased resistance and capacitance values. Consequently, the demand for minimizing the amount of connections among partitions is not though any more as a goal, while additional constraints (i.e. heat dissipation) have to be taken into consideration. More specifically, during the proposed algorithm (shown in Figure 4 ) rather than minimizing the inter-partition communication, we also pay effort to optimize the spreading of spatial distribution of application's switched capacitance, without affecting its operation frequency or the total power/energy consumption.
Since the net length is tightly firmed to its resistance and capacitance values, we manage to control the spatial locations of power sources by appropriately weighting each net, according to its switched capacitance. In addition to that, during the partitioning step, there is an effort for minimizing the weighted sum of the networks that cross each partition. Networks that exhibit high switched capacitance values should be grouped (whether this is acceptable based on the timing or area constraints) to the same partition, and thus being highly localized and therefore shorter. Similarly, the placement of logic functions with high bandwidth requirements should be done on the same functional layer, as there is a plethora of routing resources compared to the limited vertical connectivity. Even though such an approach seems to result to increased wirelegnth, it can alleviate problems related to limited amount of interlayer connectivity. More specifically, such an approach improves application implementation onto 3D FPGAs, whenever:
 Only a subset of the available SBs forms connections to the rest layers [8] . Consequently, the amount of interlayer connections (i.e. TSVs) is significantly limited, as compared to routing resources of each layer. The reduction of the interlayer connections adds more stress on the routing algorithm to connect logic blocks placed on different layers, resulting (probably) to increased wirelength.
 The partitioning algorithm is not aware either about the relative Manhattan distance among the application's functionality assigned to each layer (application's placement), or the spatial location of available (unutilized) interlayer routing resources (i.e. 3D SBs).
 The proposed P&R algorithm is a general-purpose solution, which has to be aware about alternative technologies for 3D integration (i.e. wire-bonding, TSV, Face-to-Face). The equivalent electrical characteristics for these technologies exhibits increases resistance/capacitance values, as compared the wires placed on a layer.
Equal important to the application partitioning is the task related to layer ordering, as this procedure alleviates numerous design issues related to the operation frequency and thermal stress.
More specifically, by placing closely layers that contain logic blocks belonging to critical networks, we might result to shorter lengths, and hence to smaller delay. Similarly, it is possible to prevent thermal problems, by discouraging layers that consume high amounts of power to be assigned closely or in the middle of the 3D stack (as there is more difficult to dissipate heat).
In order to quantify each of the derived solutions, we employ the cost function depicted in Equation (2) . As our algorithm can be used for architecture-level exploration, rather than providing only an application partitioning, we calculate the Pareto-based space of alternative application partitions.
These solutions balance the area occupied by active hardware resources, the number of interlayer connections and the variation of power consumption (i.e. power sources) among layers.
where denotes the variation of power sources over the 3D FPGA, the is equal to the total amount of hyperedge-cut, while the denotes the area balance among the device layers. The employed factors and provide higher flexibility to the cost function, as they of our proposed partitioning algorithm provides, can be used to tune the algorithm for further optimizing the partitioning result. Finally, we have to mention that the cost function, as well as the criticalities of the hypergraph (i.e. weights of vertexes and edges) are updated after each iteration, while the partitioning task stops when both the distribution of switched capacitance and the timing constraints are met.
Power-Aware Placement Algorithm
After the application's partitioning, the placement algorithm (shown in Figure 5 ) maps the logic functionality to the available hardware modules placed on spatial locations , , . As the majority of applications realized onto FPGAs utilize only a subset of the available hardware resources, this non uniformity leads to high variation of power consumption sources across the device [19] .
In contrast to existing placement approaches, which mainly focus on eliminating the perimeter of bounding box for active hardware elements, the proposed one also takes into consideration the spatial variation of switched capacitance across the device layers. More specifically, it tries to place the logic functions of each layer in a way that reduces the maximal switched capacitance values in hotspot regions, without affecting other design parameters (i.e. application's delay). Also, it pays effort to distribute more uniformly the power sources across the 3D FPGA, in order to result into a more balanced temperature profile.
The efficiency of the proposed placement algorithm is based on a better management of switched capacitance. As the switching activity depends mostly on the application's functionality implemented on logic blocks, while the physical capacitance of wires is proportional to the network length and the number of hardware elements that form each network, the proposed algorithm handles the switched capacitance through an efficient application placement. More specifically, by placing on adjacent spatial locations logic functions connected through nets with high switching activity, these nets will probably be shorter (exhibit smaller capacitance), leading to reduced power consumption.
Unfortunately, it is not always feasible to place closely all the logic modules connected with high switching activity networks, as it might increase application's delay (i.e. delay of the slowest path).
The functionality of the proposed placement algorithm is based on simulated annealing, while it can be explained, as follows: Starting from an initial application placement on the available hardware blocks of each layer, pairs of logic functionalities implemented in hardware blocks are randomly selected and swapped. The efficiency of each new placement is quantified by the usage of the employed cost function. Whenever the value of this cost reduced, the swap is kept. However, if the cost increases, the probability of keeping the swap is reduced with the execution time.
Even though it is true that such an approach can reach arbitrary close to the global minimum, if the cooling schedule is slow enough, it suffers from long run times for large circuits. In contrast to most of the existing approaches that start from a random initial placement, our solution employs a more "sophisticated" assignment of logic blocks, leading to shorter runtimes. This is achieved by taking into consideration during the initial placement apart from the timing and the wirelength constraints, the minimization of switched capacitance variation. Such info is available from the partitioning step (shown in previous section).
The mathematic expression of the employed cost function during the power-aware placement is depicted in Equation (3). Such an approach handles three design parameters, namely the variation of application's power consumption across the 3D FPGA, its delay, as well as the total wirelength (in contrast to relevant algorithms found in literature [13, 14] which deal only with the last two of them). (3) where:
The factors and of the cost function balance the effort among the parameters that guide the placement algorithm, in order to reduce the power distribution, the application's delay or its total wirelength. The , denotes the delay between nodes i and j (a source-sink path of a network), the factor const is a constant, while the , gives the importance, in terms of how close to the critical path, is the network . The corresponds to the switching activity for the network . In order to calculate it, the transition density for all the hardware elements that form this network is summarized. The , and parameters denote the dimensions of the 3D bounding box for network , while the is a scaling factor of the bounding box, used to make more accurate estimations about the wirelength for nets with more than 3 terminals [10] . The to the conventional bounding box approach. Otherwise, the higher the values are, the more tracks from narrowest routing channels have increased cost value, compared to wider channels. We employ a different relative cost for the vertical interconnections, as this type of connections is very limited compared to horizontal wires. Consequently, the placement algorithm has to pay effort to employ them in the optimal manner. Finally, by using an additional factor, denoted as , we discourage the placer to put functions that exchange data in blocks that belong to different layers.
This factor improves additionally the gains retrieved from partitioning, regarding the distribution of power sources.
Power-Aware Routing Algorithm
After defining the placement, the routing algorithm (shown in Figure 6 ) forms the appropriate connections through the available interconnection fabric. This algorithm is aware of spreading the switched capacitance in a more uniform manner across the 3D device with respect to the timing constraints. In order to succeed in this, the routing algorithm avoids forming paths that cross regions with high power dissipation (e.g. high switched capacitance). However, similar to placement algorithm, this is not always feasible, as it might lead either to increased application's delay or to exceed the total power budget.
As the interlayer connections (i.e. TSV) are limited compared to horizontal tracks, the proposed routing algorithm sets their weight to a higher value, in order to discourage unnecessary bends (created by a horizontal and two vertical wires). Additionally, this forces the router not to connect logic blocks placed on one layer by using interconnection fabric from different layers.
The proposed approach is based on Pathfinder negotiated congestion [4] . During the first iterations, a number of networks are allowed to share the same hardware resources however; this is gradually prohibited, until to the final routing where each network uses dedicated routing fabric. The employed power-aware router guarantees to find the narrowest horizontal and vertical channel widths for which the application is fully routable.
The efficiency of a derived application routing, is quantified with the cost function. The mathematical expression regarding this function is shown in Equation (4). 
In this function, corresponds to the delay of hardware resource , while the factor defines the importance of power control during the routing procedure. By appropriately tuning these parameters, the resulted routing can be optimized to improve the application's delay, the distribution of power sources across the device layers, or any combination (trade-off) between them. The parameters , and represent the base cost, the historical congestion cost and the present congestion cost for the hardware element , respectively. In order to come to acceptable solutions (i.e. to prevent the overuse of routing resources), the value of increases with the execution time.
The factor corresponds to the normalized capacitance of resource , while the refers to the importance of controlling switching activity for network . The mathematic expression of this parameter is shown in Equation (5).
Here the _ is the maximum allowed value of switching activity regarding to the network , while the ratio _ gives the normalized switching activity over all the application's interconnection networks. As the value of _ parameter closes to 1, then more and more interconnection networks with high switching activity will be taken in consideration during the routing congestion.
Comparing the proposed cost function to alternative found in relevant references [8, 10, 13, 14] , it has a similar timing parameter. So, whenever a connection is timing critical, the router pays effort to reduce the congestion and optimize the delay. However, the feature for handling the spatial distribution of switched capacitance, results to a more balanced profile of power consumption sources, and hence to smaller gradient of the on-chip temperature over the 3D FPGA.
EXPERIMENTAL RESULTS
We implement the proposed power-aware P&R algorithm in C++, as a new open-source tool for 3D
FPGAs, named 3DPRO [23] . This section provides qualitative and quantitative comparisons among the proposed power-aware P&R algorithm and the alternative solutions found in relevant literature.
More specifically, we compare the efficiency of our 3DPRO tool, against to PR3D [14] and to TPR [13] , which are the only available tools for P&R on 3D FPGAs. The results are summarized in Table   I .
Given a 3D topology, the proposed power-aware approach, and hence the 3DPRO CAD tool, supports architecture-level exploration with a plethora of parameters such as delay, energy/power consumption, leakage power and silicon area. Furthermore, it supports the evaluation of 3D
architectures, in terms of fabricated TSVs, in contrast to the TPR tool which employs only a scenario where all the SBs form connections to the adjacent layers. The reduced number of TSVs was shown [8] that results to smaller fabrication and yield costs, without any performance degradation.
Regarding to the PR3D tool, it is not publically available and hence it is not possible to evaluate it against to the rest implementations (the corresponding features of this tool shown in Table I were retrieved from literature).
The efficiency of our proposed power-aware P&R was also proven with quantitative comparisons against to conventional (i.e. timing-aware) mappings, considering the MCNC benchmarks [15] . In order to show the complexity of the employed benchmarks, Table II gives a detail description regarding their required number of 4-input LUTs, the minimum dimensions of the array for 2D and the 3D FPGA, as well as the percentage of utilized logic resources. From this figure we conclude that the switched capacitance vary a lot, even for hardware resources assigned to adjacent spatial locations onto the same layer. Also, it is possible to locate regions on the layers with excessive high values of switched capacitance. In order to understand the thermal characteristics and prevent circuit failure, it is important to detect such hotspots regions. By specifying their spatial distribution, the designer can concentrate his/her efforts to control the switched capacitance on certain regions only, but not on the whole device, reducing among others the design/fabrication cost [19] .
The proposed power-aware P&R algorithm can assist to provide a solution to this problem, as it is aware about the distribution of switched capacitance across the 3D FPGA. Figure 8 plots the corresponding variation of switched capacitance regarding the same application and 3D device, for the proposed algorithm. In contrast to the conventional approach (shown in Figure 7 ), the proposed one exhibits more balanced variation of switched capacitance, and hence for power consumption and for on-chip temperature. Additionally, the maximal values of switched capacitance are lower, leading to cheaper and more reliable devices. One more conclusion might be derived from these two graphs.
More specifically, the proposed approach distributes more efficiently the switched capacitance for the layers placed on the middle of the 3D stack. This feature is critical for the thermal efficiency of the target 3D architecture, as it is more difficult to dissipate heat from these layers.
For sake of completeness we also provide results regarding the proposed power-aware P&R algorithm, for two flavors. More specifically, the first of them (non-aggressive) affects an approach where the importance of application's delay is thought to be similar to the distribution of power sources, while regarding the second one (aggressive); the employed cost functions are tuned to achieve even more uniform power profile (i.e. reduction of hotspot regions) across the 3D FPGA. Figure 9 gives the average distribution of power sources (over the MCNC benchmarks) for the two alternative setups of the proposed P&R algorithm against to a conventional (i.e. timing-aware) approach. As we may conclude, the proposed solution achieves to reduce the percentage of area that operates under high power values (i.e. belonging to hotspot regions), while it spreads these power sources on the rest device in a more uniformly manner. This is especially critical for designing reliable and cheaper devices, as there is no need for expensive packaging solutions. Moreover, by transferring power consumption from hotspot regions to the rest architecture, we increase the device reliability and reduce its fabrication cost.
The upcoming Tables summarize the evaluation results of applying the proposed strategy to the 20 biggest MCNC benchmarks. The results were retrieved using the 3DPRO tool, without and with the power-aware P&R feature. The target 3D FPGA device was described in Section 2, while the array dimensions for each benchmark is derived from Table II . Table III compares the total wirelength for alternative application implementations, both for 2D
and 3D FPGAs. More specifically, here we show that the proposed power-aware approaches targeting 3D FPGAs result to increased wirelength (between 6% and 22%), as compared to timing-aware P&R.
However, both of them are better than the corresponding shown in relevant reference [8] for 2D
architectures about 19%. As we will show later, the increased values of this parameter cannot outperform the advantages of realizing applications with the proposed power-aware P&R algorithm. Table IV gives the delay for each of the 20 biggest MCNC benchmark with the usage of alternative P&R algorithms for 2D and 3D FPGAs. Based on the results, the proposed power-aware P&R algorithm increases slightly the application's delay, ranging from 1% up to 5%, while the performance improvement compared to solution targeting 2D FPGAs [18] is up to 27%. The almost negligible performance degradation (due to the extra constraints for forming connections) is acceptable, as it does not lead to significant variation of the application's functionality. Finally, we study the percentage of device area that operates under high power. This part of device area is mentioned as hotspot region, while its reduction is the main goal of the developed research.
The two flavors of the proposed power-aware P&R algorithm achieve to reduce this percentage, compared to conventional (i.e. timing-aware) P&R for the same 3D FPGA ranging from 30% up to 63%. Moreover, the reduction of area coverage for hotspot regions compared to existing approaches for 2D FPGAs [18] is about 58%.
The results presented in this section prove that the proposed power-aware P&R achieves to reduce the percentage of silicon area that operates under high power/temperatures (hotspot regions), which is the main goal of our research, without impact on other critical design parameters, even though there is an increase in total wirelength. This occurs due to the better application partitioning, partition to layer assignment, placement and routing.
The gains of employing the proposed power-aware P&R approach targeting 3D FPGAs can be summarized as follows: (i) it spreads the power sources across the 3D FPGA in a way that it is more easy to dissipate heat, (ii) it reduces the peak values of power sources leading to cheaper fabrication cost for cooling, (iii) it reduces the total energy consumption increasing among other the battery life and the system's reliability, and (iv) it reduces significantly the percentage of silicon area that operates under high power consumption values (i.e. hotspot regions), which can be thought as a power/temperature management approach.
CONCLUSIONS
A novel power-aware P&R algorithm targeting 3D FPGAs, as well as its software implementation at 3DPRO tool, was presented. This approach can be thought as a power management strategy, because it achieves to re-distribute the total power budget over identical hardware resources, in a way that the produced heat is easier to be dissipated. More specifically, by appropriately controlling the spatial distribution of switched capacitance, the proposed P&R algorithm reduces about 63%, in average, the percentage of device area that operates under high power. Moreover, this reduction is achieved in conjunction to energy savings about 9% (in average), with an almost negligible penalty (ranging from 1% up to 5%) in application's delay. 
ACKNOWLEDGMENTS

