Abstract-Integrating more functionality in a smaller form factor with higher performance and lower-power consumption is pushing semiconductor technology scaling to its limits. 3-D chip stacking is touted as the silver bullet technology that can keep Moore's momentum and fuel the next wave of consumer electronic products. Additionally, the complexity of digital designs imposes that computer-aided design algorithms are getting harder and slower. This paper introduces a framework for application implementation onto 3-D reconfigurable architectures. In contrast to existing approaches, the proposed solution is customizable according to constraints posed by the application and the target 3-D device in order to improve performance metrics. Experimental results highlight the effectiveness of our framework, as we achieve average enhancements in terms of maximum operation frequency and power consumption by 35% and 47%, respectively, as compared to state-of-the-art algorithms.
I. INTRODUCTION
F IELD-PROGRAMMABLE gate arrays (FPGAs) were first introduced almost two and a half decades ago. Since then, they have exhibited rapid growth and have become a popular implementation medium for digital systems. The programmable nature of their logic and routing interconnect make them flexible and general purpose but at the same time it makes them larger, slower, and more power consuming as compared to the standard cell application-specific integrated circuits (ASICs).
For decades, semiconductor manufacturers have been shrinking transistor size in integrated circuits to achieve the yearly increases in performance described by Moore's law. This law exists only because the RC delay was negligible, as compared to the signal propagation delay [1] . However, for submicrometer technology, the RC delay becomes a dominant factor. Previous studies showed that at 130 nm technology node, approximately 51% of the microprocessor's power is consumed by interconnect fabric [2] . This has generated many discussions concerning the end of device scaling as we know it, and has hastened the search for solutions beyond the perceived limits of current devices. 3-D integration is considered as a technology that will extend Moore's law into the next years. Stacking multiple dies in the vertical axis and interconnecting them using very finepitch through-silicon vias (TSVs) enables the creation of chips with very diverse functionalities implemented in different process technologies in a very small form factor. Moreover, the locality along the z-axis enables shorter routing paths, which in turn improves performance and power metrics, as compared to the corresponding 2-D system implementations [3] .
The benefits of using 3-D integration in FPGAs are especially great, since these architectures suffer from data communication problems; the delay and power consumption of routing network are the main bottlenecks compared to ASIC implementations [4] , [5] . The interest for designing 3-D reconfigurable architectures was already addressed by the reconfigurable industry. Typical examples are the 3-D FPGAs provided by Tezzaron [6] , as well as the 2.5-D Xilinx Virtex-7 and UltraScale devices [7] . Besides, recently, Xilinx announced the 16 nm 3-D UltraScale+ FPGA family, which exhibits a 2-5× performance-per-watt advantage over comparable system designed with Xilinx's 28 nm devices [8] .
Apart from the advancement in process technology, the performance depends also to the employed computer-aided design (CAD) algorithms. Table I provides a qualitative comparison among recently proposed academic tool flows for application implementation onto 3-D FPGAs. A number of conclusions might be derived based on this analysis. Among others, the majority of existing tools focus only to homogeneous 3-D platforms. Although FPGAs are commonly designed as homogeneous architectures, the integration of different process technologies (e.g., logic, memory) into the same chip [9] and the design of domain-specific 3-D platforms [10] , achieve significant higher performance metrics [1] , [3] , [4] . Moreover, none of the existing CAD algorithms are customizable according to requirements posed either by the application, or the 3-D device. Finally, it is worthwhile to mention that the available algorithms [11] - [18] rely mostly on straight-forward extensions of existing 2-D tools, which cannot fully exploit the benefits of 3-D technology. On contrary, physical design in the 3-D realm requires fresh ideas (i.e., new algorithms and cost functions). For instance, existing flows pay effort to minimize the connections between layers. However, upcoming sections highlight that such an objective is not the case for 3-D FPGAs, since these interlayer connections are fabricated in advance of application implementation. Hence, one might expect that a more efficient utilization of these TSVs will improve performance metrics.
0278-0070 c 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This paper introduces a framework for application implementation onto 3-D reconfigurable architectures. In contrast to the rest of the flows summarized in Table I , the algorithms of introduced framework can be customized according to application's and device's characteristics. In particular, the contributions of this paper are as follows.
1) We prove that existing frameworks cannot fully benefit from the architectural features found in 3-D FPGAs in order to improve performance metrics. 2) We introduce a novel framework for physical design onto 3-D FPGAs. The CAD algorithms of this framework can be customized according to application's and platform's characteristics. 3) We introduce algorithms for addressing the netlist partitioning, placement, and routing problems targeting 3-D reconfigurable platforms. 4) We provide a software-supported systematic methodology for performing power/energy estimation regarding designs mapped onto 3-D FPGAs. The proposed framework is evaluated with various benchmarks and 3-D FPGA devices. More thoroughly, experimental results shown average reduction of total wire-length, critical path delay (i.e., maximum operation frequency), and power consumption by 9%, 35%, and 47%, respectively, as compared to state-of-the-art algorithms. Additionally, we show that the proposed algorithmic customization achieves to enhance the efficiency of our framework without imposing any overheads in term of execution run-time.
The rest of this paper is organized as follows. Section II presents the architectural template for the underline 3-D FPGA, whereas Section III describes the motivation behind this paper. The proposed framework, as well as the employed CAD algorithms, are discussed in Section IV. Experimental results that quantify the efficiency of our solution against to state-of-the-art relevant tools are provided in Section V. Finally, Section VI summarizes the respective conclusions of this paper.
II. ARCHITECTURAL TEMPLATE FOR 3-D FPGA
This section describes in detail the architectural template of the target 3-D FPGA, depicted schematically in Fig. 1 . The 3-D architectures have up to five layers (l ≤ 5), each of which is modeled as a homogeneous array of logic blocks in a two-level hierarchy. More thoroughly, the first level of hierarchy, also referred to as basic logic element (BLE), consisted of a K-input lookup table (LUT) and a flip-flop, can implement any K-input logic function. The second level of hierarchy, mentioned as configurable logic block (CLB), is formed by a group of N BLEs. For our experimentation, we consider an architecture with K = 4 and N = 5. Previous studies have shown that such an architectural template provides a good compromise among the application's critical path delay, the power consumption, and the area overhead [19] . Although there are a total of K × N inputs and N outputs inside the CLB, our architecture, similar to commercial devices, uses clusters that are less than fully connected. More specifically, the number of inputs to the logic cluster (I) was set based on the cluster size (N) and the LUT size (K) using the formula I = K/2 × (N + 1) = 12 in order to ensure that 98% of the BLEs can be utilized [20] .
The interconnection infrastructure inside each layer, consisted of routing channels among CLBs, is also of high importance. A routing channel contains a number (W H ) of individual routing tracks, while each track is formed by segments which travel a distance L in CLBs before being interrupted by a programmable switch found inside a switch box (SB). The routing network modeled within this paper relies on a multisegment interconnection scheme with segments L1, L2, L6, and longlines (that span the entire device), whereas similar to Xilinx devices the distribution of these segments per channel is 8%, 20%, 60%, and 12%, respectively [7] . As the connections with longer segments impose fewer switches, they lead to higher performance and lower power consumption. On the contrary, the shorter lengths provide higher flexibility for routing bends. Consequently, the employed multisegment interconnection scheme maximizes the flexibility to route a signal within a layer. Regarding the interlayer connectivity, it is realized with vertical aligned TSVs [21] found inside SBs. Fig. 2 gives an example of the connections that take place inside a 3-D SB. Additional details about the design of these components can be found in [10] . Since the fabrication of these vertical interconnects with high density and yield is a complicated task [3] , [4] , their number is significantly limited compared to the rest routing tracks (W H ). For our experimentation, we assume 3-D FPGAs with W V = 3 and W H = 50. Table II summarizes the architectural properties for the target 3-D FPGA. The physical and electrical equivalent characteristics for the technology parameters are based on relevant publications [21] - [23] . Although our experiments make use of this fairly standard island-style template, the introduced framework is also applicable to any other architectural organization. However, the selection of a homogeneous device is performed solely for comparison purposes against relevant tools. 1 Additionally, as we will discuss later, the selected number of TSVs per 3-D SB (W V = 3) is the minimum one for successful application's placement and routing (P&R) with existing (reference) state-of-the-art CAD tools [12] , [13] . Note that for the sake of completeness, the value of W V is constant for all experiments.
III. MOTIVATION AND CHALLENGES
The efficiency of application mapping onto a 3-D FPGA is tightly coupled to the employed CAD algorithms. This imposes the requirement for more powerful, native 3-D physical design tools that are built from ground up and are capable of handling 3-D aware optimization objectives. However, in contrast to this, the majority of existing tools rely on straightforward extensions of algorithms targeting 2-D platforms [11] - [18] . In order to demonstrate that these tools do not fully benefit from the 3-D technology, Fig. 3 plots the percentage of utilized TSVs for devices consisted of l ∈ {2, 3, 4, 5} layers. For demonstration purposes, the vertical axis is normalized over the number of fabricated TSVs per device. 2 The target 3-D FPGA follows the architectural template discussed in Section II, while the application implementation is performed with state-of-the-art CAD algorithms in academia for 3-D FPGAs, namely the hMetis netlist partitioning [18] and the TPR P&R [13] .
The results of Fig. 3 indicate that on average only 5%, 9%, 12%, and 15% of the fabricated TSVs are utilized for devices consisted of 2-5 layers, respectively. Specifically, as we will discuss later, only a small subset (on average 5%) of the 3-D SBs fully utilize the fabricated TSVs. Furthermore, apart from a few benchmarks that exhibit spikes at the utilization ratio (ranging up to 25%-30%), the majority of benchmarks utilize even fewer TSVs compared to the corresponding average values. One more conclusion might be derived from this analysis. Thoroughly, application implementation onto devices consisted of additional layers does not impose higher utilization ratio for the TSVs. Note that, the overall picture of Fig. 3 is not affected by the employed benchmark suite.
These outcomes occur mainly because the CAD tools rely on algorithms initially developed for 2-D platforms (e.g., multichip architectures). Since the interchip connectivity at these platforms exhibits considerable higher electrical equivalent characteristics (RLC) compared to a TSV connection, 3 the physical implementation algorithms focus mostly on min-cut partitions (i.e., to minimize the number of connections among partitions) in respect to the balance of partitions size (number of CLBs per partition). However, at the 3-D domain, this is not the case because the spatial locality among application's functionalities assigned to different layers reduces the overall routing wire-length. Thus, mentionable performance enhancement is feasible in case that the TSVs are appropriately employed. Additionally, the concept of utilizing more efficiently the interlayer connections is of outmost importance at the FPGA domain, since the TSVs are fabricated in advance of application implementation; hence, no additional cost is imposed if the design utilizes these TSVs.
IV. PROPOSED FRAMEWORK This section introduces the proposed framework for application implementation onto 3-D FPGAs. This framework, illustrated in Fig. 4 , consists of three phases, namely training, customization, and physical implementation. In contrast to relevant approaches, the employed CAD algorithms are customizable according to application's and platform's characteristics with an artificial neural network (ANN).
The efficiency of an ANN highly depends on the weights of its neurons. These weights are computed with a special procedure, referred to as training. During this phase, a representative number of benchmarks are profiled to extract various characteristics both of application (i.e., number of primary inputs/outputs, max/average fanout, and number of blocks per net) and platform (i.e., number of layers, availability of resources, and TSVs). Based on this data, the ANN is trained to provide the optimum customization of CAD algorithm's parameters per application and target 3-D FPGA. Table III summarizes these algorithmic parameters for customization, as well as their range. Having a trained ANN, we proceed to the physical implementation phase, which consists of two consecutive steps. Initially, the CAD algorithms are customized according to the ANN's output and then the optimized algorithms perform netlist partitioning, placement, and routing onto the target 3-D FPGA.
The introduced framework has also two refinement loops (marked with red dotted lines) that improve further the quality of algorithmic customizations. Specifically, the output of netlist partitioning affects the actual utilization of TSVs during the application's routing. As we will discuss at Section IV-B, the estimation of this parameter is feasible to be performed. Similarly, the outcome of placement algorithm imposes additional constraints to netlist routing, i.e., based on the spatial locality among CLBs. Thus, the refinement loops "(1)" and "(2)" provide additional flexibility to the ANN in order to derive either a more, or less, aggressive customization of CAD algorithms depending on the availability of nonutilized routing resources. Note that each tool from the physical implementation phase is executed only once.
The rest of this section describes the employed ANN, as well as the proposed algorithms for addressing the netlist partitioning, placement, and routing problems.
A. Customization of CAD Algorithms
The algorithmic customization discussed in Fig. 4 is performed with an ANN, which is an interconnected group of artificial neurons that uses a mathematical, or computational model, to realize complex relationships between inputs and outputs. The rest of this section describes in detail the ANN's architecture, as well as its training procedure.
1) Architecture of Artificial Neural Network:
The architecture of our ANN consists of three layers namely input, hidden, and output. Typically, each layer receives input(s) from the previous layer and forwards its output(s) to the next layer (feed-forward structure). Designing an ANN therefore means coming up with the number of hidden layers, as well as defining the internal organization for each layer.
An ANN's layer is modeled using a weight matrix W, a bias vector b, and an output vector a. The modeling of multiple layers imposes that instead of a unique weight matrix W, we use an input weight (IW) and a layer weight (LW) matrix. At this notation, superscripts denote the source (second index) and the destination (first index) for network elements. The ANN's input is encoded with a vector p, whereas the net input n is computed as the sum of the bias b and the product W × p. Then, the sum is passed to the transfer function f in order to get the neuron's output a.
Fig . 5 plots the block diagram for the proposed ANN. This network relies on the multilayer perceptron algorithm, 4 which is a widely studied and used ANN classifier. Among others, such an ANN can model complex functions, while it is robust to noise (good at ignoring irrelevant inputs) and flexible to adapt its weights in response to environment changes.
2) Determining Network Parameters: There are a number of parameters that have to be decided upon when designing an ANN. These parameters are as follows.
1) Number of Hidden Layers:
The hidden layer(s) is(are) necessary to capture nonlinear dependencies between the data's features and the variables that we are trying to predict. Additional layers of hidden neurons enable greater processing power and system flexibility (i.e., model more complex functions) but they come at the cost of higher complexity for training.
2) Number of Neurons Per Hidden Layer:
Having too many hidden neurons the system is over specified and is incapable of generalization. On the other hand, having too few hidden neurons, prevents the system from properly fitting the input data, and thus, it reduces the system's robustness.
3) Transfer Function Per Layer:
The transfer function f l translates the input signals to output signals for layer l. Previous studies have shown that the combination of different transfer functions improves the ANN's efficiency [25] , [26] . 4) Momentum: It adds a fraction of the previous weight update to the current one in order to prevent the system from converging to a local minimum, or saddle point. Although a high momentum parameter increases the speed of system's convergence, it also creates a risk 4 Supervised predictor of an input into one of several possible nonbinary outputs. of overshooting the minimum, causing the system to become unstable. On the other hand, a low momentum coefficient leads to an ANN that cannot reliably avoid local minima, while showing down the training phase as well. 5) Learning: What has attracted the most interest in ANN is the possibility to learn. This task is performed during the training phase, which relies on a combination of learning paradigms, rules, and algorithms [25] . A widely accepted approach for optimizing the ANN's parameters is trial and error. However, this approach is inefficient and might miss the optimum network design. In order to overcome this limitation, we allow networks to be tested across the design parameter space with the procedure depicted in Fig. 6 . For our analysis, the ANN's training is performed with a representative number of benchmarks from Altera's QUIP suite [27] . The parameter i in Fig. 6 denotes the algorithm for customization (i ∈ {partition, placement, routing}), j refers to the algorithm's i parameter depicted in Table III , while k is the value where each of the parameter j ranges to. We assume that k takes 15 uniformly distributed values per parameter in the range depicted in the last column of Table III . Thus, in order to quantify benchmark z, this benchmark is partitioned, placed, and routed with the customized flavor of the CAD algorithms according to the k values. Fig. 7 outlines the exploration results for alternative number of hidden layers and neurons. Each of these parameters spans from 1 to 10, while the evaluation of ANN's architecture is performed according to the root mean square error (RMSE) metric defined by
where Y z is the optimal combination of ANN's parameters as they are retrieved from the exhaustive search-space exploration discussed in Fig. 6 ,Ŷ z is the predicted result by the ANN for benchmark z, and B denotes the total number of benchmarks from the QUIP suite [27] . The term optimal refers to the combination of algorithmic's parameters that minimizes the average energy×delay product (EDP) among the studied benchmarks. This analysis indicates that optimal results are retrieved for an ANN with one hidden layer, which contains
The transfer function per hidden/output layer is another design parameter that has to be specified. For this purpose, a number of well-established functions, namely hardlim, logsig, radbas, satlin, translg, sigmoid, and linear [25] , [26] , [28] were instantiated and their efficiency for our problem was quantified. Based on this experimentation, we found that minimum time for convergence is retrieved when the hidden and output layers employ the sigmoid and linear transfer functions, respectively.
After defining the ANN's architecture, we proceed to the training phase, where the weight per neuron is determined. These weights represent an abstraction of the mapping of input vectors to the output signal for those benchmarks that the ANN was exposed during training phase. The training for our ANN relies on a back-propagation with momentum algorithm (see Algorithm 1), since previous analysis has proven the efficiency of such an approach [25] . In more details, the momentum feature improves the rate of convergence (speedup's the learning process) as compared to the conventional back-propagation algorithm, while further improvement is feasible by employing biased neurons with constant input equals to 1. In order to perform an efficient training, careful selection of representative (according to corresponding characteristics of real-life applications) benchmarks is necessary. The 70% of the selected benchmarks are employed for ANN's training, whereas the validation and testing are performed with the rest 15% and 15%, respectively.
The initial network weights are tuned to random values. This is a widely accepted approach to overcome trapping to local error minima [25] , [26] . However, such a selection imposes that two identical networks can produce very different calibration models once trained. Repeated evaluation of ANN is therefore essential in getting a clear idea of the capability of each design. Toward this direction, for each permutation of network parameters, we train five distinct networks in order to reduce the impact of the random initial starting conditions.
During the training phase, the weights are updated based on a modified delta rule, where an error signal is calculated for each node and it is backwardly propagated through the network, starting at the output layer and weighted back through the previous layers. Fig. 8 quantifies the ANN's efficiency, where different colors denote the errors for training, validation and testing. These results indicate that the error is small enough for the majority of studied benchmarks. Hence, one might claim that our ANN can address sufficiently the algorithm customization problem for almost any benchmark.
B. Partitioning and Layer Ordering
Netlist partitioning is an important problem associated with the physical design onto 3-D architectures. Mathematically, the netlist partitioning is an NP-hard problem, where no polynomial-bounded algorithm for finding the global optimal solution is likely to exist. Given an undirected hypergraph G = (V, H) and V, H being the set of vertices and hyperedges, respectively, a k-way balanced partitioning of G is defined as a function p : V → 1, 2, . . . , k that distributes the vertices of V among k disjoint subsets S 1 ∪ S 2 ∪ . . . ∪ S k = V of roughly equal size. The partitioning function induces a new hypergraph G p = G p (S, H c ) , where S = S 1 , S 2 , . . . , S k and a hyperedge S i , S j ∈ H c exists if there are two adjacent vertices u, v ∈ V such that u ∈ S i and v ∈ S j . The set H c corresponds to the set of cutting hyperedges of G induced by the partition. Regarding our implementation, the term vertex refers to application's functionality technology mapped onto a CLB, while a hyperedge corresponds to a network path. Throughout this paper we assume that benchmarks are represented as hypergraphs, hence the terms benchmark, application, netlist, and hypergraph are used interchangeably. CandidateList ← empty; 8: for (P candidate ∈ P best neighborhood ) do 9: if (ContainsAnyFeatures(P candidate , TabuList)) then
10:
CandidateList ← P candidate ; The majority of existing CAD algorithms aim to minimize the connections among partitions, subject to design constraints such as the maximum partition size and the maximum path delay [18] . However, at Section III we showed that these algorithms cannot benefit from the availability of fabricated TSVs in order to derive shorter routing paths. Therefore, this section introduces our iterated tabu search algorithm (depicted at Algorithm 2) for addressing the netlist partitioning and layer ordering problems. The proposed solution relies on an existing algorithm [16] , which was extensively revised for additional flexibility. More thoroughly, apart from the pure 3-D aware objectives, a number of algorithmic parameters (summarized in Table III) are customized with the proposed ANN.
The introduced algorithm is a global optimization strategy based on a hill climbing, which is characterized by aggressive local search during each iteration. The tabu search maintains a short term memory (TabuList) to prevent the algorithm from returning to recently visited areas of the search space, referred as cycling. For this purpose, the short term memory is managed by a mechanism that involves historical information about the last moves. Note that the short term memory alone does not ensure that the search will be effective. Thus, two complementary memory mechanisms are employed: 1) the intermediate-term memory (ITM) to focus the search on promising areas of the search space (intensification), called aspiration and 2) the long-term memory (LTM) to encourage useful exploration of the broader search space, called diversification. The size for these memories is customized according to the ANN's output.
Apart from the ITM and LTM customizations, the proposed implementation exhibits a number of differentiations as compared to similar tabu-based heuristic engines. Specifically, instead of random vertex selection, we favor moves that improve to some degree the quality of solutions, as it is defined by the cost function (2) . Note that we also use some form of randomization, since during each iteration a movement is selected randomly from a list of the most attractive candidates. Additionally, given a partition S i , our algorithm defines and evaluates the relocation neighborhood RN(S i ) to be a set of all the solutions (i.e., partitions) that can be obtained from S i by relocating a single vertex. The evaluation of derived partitions is based on the cost function given by
where: 1) W balance balances the total weight for vertices among partitions. Let |u| denote the weight of a vertex u.
Then, the weight w(S i ) of a subset S i is equal to the sum of weights of the vertices in S i , where w(S i ) = u∈S i (|u|).
Since throughout this paper we target homogeneous 3-D FPGAs, the weights at each vertex equals to 1. Thus, the W balance factor leads to solutions with similar number of vertices per partition; 2) W hcut minimizes the hyperedge-cut according to the aggressiveness factor g, as it is derived from the ANN. The hyperedge-cut is a first order metric for the number of utilized TSVs after successful applications' P&R. Smaller values for g factor lead to partitions that utilize fewer TSVs. On the other hand, higher values are expected to improve performance metrics due to spatial locality but imposes the utilization of additional TSVs; 3) W order minimizes the distance (in term of 3-D FPGA's layers) among the source node u and all the network's sink nodes. In contrast to existing platform-agnostic partitioning algorithms, this objective minimizes the wastage of TSVs by addressing the layer ordering subproblem. The values of parameters c and ξ are equal to 0.6, as they are defined based on the detailed search-space exploration depicted in Fig. 9 . Note that such a combination of weighting factors gives higher importance to minimizing the hyperedge-cut according to aggressiveness factor g.
The outcome from netlist partition highly affects the number of utilized TSVs. Since the actual utilization of these resources is derived after successful application's routing, an estimation would be a valuable instrument for system designers. Such an estimation becomes far more important if we take into account that P&R is a timing consuming task. Thus, mentionable reduction at design cost is feasible by pruning the design space solutions that are known to violate system's constraints (e.g., exceed the number of fabricated TSVs). Such an estimation is also taken into consideration during the refinement loops discussed in Fig. 4 .
The results of this analysis for the employed benchmark suite [29] are visualized in Fig. 10 , where the left vertical axes give the ratio [according to (3) ] between the hyperedgecut and the number of utilized TSVs, as they are retrieved after netlist partitioning and physical implementation, respectively. In other words, such a ratio corresponds to the overestimation of actual demand for TSVs based on the hyperedge-cut metric. For demonstration purposes, we plot also (with dots) the average ratios per layer. Based on this analysis, we might conclude that the ratio decreases monotonically as we proceed to devices with additional layers, since these layers provide higher flexibility to address the layer ordering subproblem (W order objective).
Apart from the average values, the variation of this ratio (i.e., difference between maximum and minimum value) is also crucial for accurately estimate the number of TSVs. In order to evaluate this goal, the right axes in Fig. 10 give the average variation of ratios among the studied benchmarks according to (4) . Although the TPR tool [ Fig. 10(a) ] leads to smaller average ratios in comparison with the proposed framework [ Fig. 10(b) ], our solution reduces the corresponding variations by 50% on average. Thus, our framework provides higher fidelity to the designer about the actual demand for TSVs early at the design phase ratio = hyperedge − cut after netlist partitioning utilized TSVs after netlist routing (3)
C. Placement Placement is one of the most influential steps in the FPGA CAD flow, as it is directly responsible for determining the relative locations of CLBs in order to minimize wire-length (an extension of traditional 2-D half perimeter bounding box by adding the offset of the z coordinate) and critical path delay subject to the minimum possible penalty in routing congestion.
This section presents a high-quality placement algorithm targeting to 3-D FPGAs. The proposed placer, depicted at Algorithm 3, is based on simulated annealing; a heuristic for minimizing an objective cost function which takes real values over a set of states. Given a perfect cooling schedule, simulated annealing will ultimately converge to an optimal while (counter < max_iter) do 6: new_place ← perturbation(place); 7: delta ← cost(new_place) − cost(place); 8: r ← random(0, 1); 9: if (r < e − cost T ) then 10: place ← new_place; 11: end if 12: counter++; 13: end while 14: counter ← 0; 15: temp ← schedule(temp); 16: end while 17: return place; solution; however, the design space exploration must satisfy some conditions for optimality to be assured [30] .
The evaluation of candidate placements is performed based on the cost function described by (5) . This function aims at minimizing the application's timing (T cost ), wire-length (W cost ) and power consumption (P cost ) metrics, while the weights α and β define the importance among these factors
where delay(u, v) denotes the estimated delay between nodes u and v (a source-sink path of a network), crit(i) gives the importance in term of how close to the critical path is the network i, and const is a constant value. The bb x (i), bb y (i), and bb z (i) correspond to the dimensions of the bounding cube for network i, while the parameters C av,x (i), C av,y (i), and C av,z (i) give the average width of routing tracks across the x, y, and z directions, respectively, over the bounding cube of net i. The q(i) parameter scales the wiring model discussed previously to take into consideration nets with more than three terminals [31] . The value of this parameter is applicationspecific and it is retrieved from the ANN. Specifically, the q(i) parameter depends on the number of terminals of net i; q is 1 for nets with 3 or fewer terminals, and slowly increases to 2.65 for nets with 50 terminals [31] . Finally, act(i) is the switching activity for network i, γ is a constant that denotes the inverse channel width utilization ratio, λ is the average number of inputs used in a logic block, r is the Rent's exponent, and n c is the number of CLBs required for a given circuit [32] , [33] . The switching activity per network is computed with an extended version of the well-established in academia ACE 2.0 tool [34] , [35] . Exhaustive exploration with various benchmarks and 3-D FPGAs indicate that optimum EDP is retrieved for α = 0.3 and β = 0.7 (see Fig. 11 ).
The quality of placements is tightly firmed to the employed cooling scheme, which defines the maximum allowed distance for CLB swapping [20] . In order to study more thoroughly the impact of this parameter, we evaluate the efficiency for six alternative cooling schemes (depicted in Fig. 12 ). The vertical axes (T k ) at this figure denote the temperature at iteration k, where 0 k A and A is the total number of moves during the annealing process. For the sake of completeness, the initial (T A ) and final (T 0 ) temperatures are identical among the studied cooling schemes.
These schemes were implemented as part of our placer and their efficiency was quantified based on the size of bounding cube. Fig. 13 plots the average values for this metric among the studied MCNC benchmarks [29] . For demonstration purposes the vertical axis is plotted in a normalized manner over the maximum value among the benchmark suite. Based on this analysis, we found that the cooling scheme described by (9) [ Fig. 12(e) ] minimizes the size of bounding cube, and therefore it is selected for the rest of our experimentation
As FPGAs continue to grow in size, the large run-times incurred by simulated annealing are becoming prohibitive. Accordingly, instead of the conventional simulated annealing engine, which usually spends time revisiting previously explored states, the proposed algorithm is forced to explore also neighbor states to already known good solutions. These directed moves are likely to reduce the amount of time required for the annealing process to find the final (lowest cost state) placement. Specifically, the radius (Manhattan distance) for the neighbor locations, mentioned as ϕ, ranges between 5% and 25% of the maximum allowed distance for CLB swapping per temperature T k [20] .
Typically, a given directed move will not impact the same changes in the cost function at different temperatures. Ideally, directed moves will be employed when they are most likely to improve the cost function. To accomplish this goal, our framework incorporates a parameter (σ ) to define the percentage of directed moves compared to the total swaps per temperature. These moves have to be implemented very carefully; otherwise the risks of oscillation and converging to a local minimum are raised. Based on our exploration we found that as the temperature cooled, the directed moves become more effective than the random moves because at lower temperatures, almost all the logic blocks are placed in their optimal median ranges, and timing paths are relatively straight. Consequently, the random perturbations become less effective than the simpler directed moves. Considering that both ϕ and σ parameters are application-depended, their values are retrieved from the employed ANN.
Finally, in our previous work we have shown that a reduction at the number of swaps that are quantified per temperature T k improves the placer's execution runtime with a controllable overhead at the quality of derived solutions [36] . Specifically, existing placers based on simulated annealing, both for 2-D [20] and 3-D FPGAs [10] , [12] - [15] , [17] , perform Q = 10 × G 1.33 swaps per temperature, where G is the number of utilized CLBs. In this paper, we have extended the concept initially proposed at [36] toward two orthogonal directions: 1) to be applicable at 3-D FPGAs and 2) to customize the number of swaps per temperature (Q) according to the ANN's output. For this purpose, two flavors of our framework are evaluated: the first of them, referred to "normal," performs Q = 10 × G 1.33 swaps per temperature, similar to existing placers. On the other hand, the number of swaps at the second flavor, referred as "fast," is application-depended and it is derived from the ANN. As it is depicted in Table III , in such a case the number of evaluated swaps per temperature ranges between 10 × G 0.05 and 10 × G 1.33 .
D. Routing
The routing problem in FPGA domain relies on determining which programmable switches have to be turned on to connect all the CLB input and output pins required for the application's functionality. This problem becomes even for (each net i) do 4: rip-up routing tree (net i); 5: for (each sink v of net i in decreasing crit(i, v) order) do 6: breadth first routing (sink v); 7: for (all nodes in the path from u to v) do 8: update(congestion cost ε); 9: end for 10: end for 11: end for 12: update (historic congestion); 13: compute (routing tree delay); 14: update (timing graph); 15: update (net crit(i, v)); 16: end while 17: compute (timing, power, wiring); 18: return (route); more challenging for 3-D devices in the sense that interlayer connectivity is provided only by a limited (as compared to planar routing wires) number of TSVs. Algorithm 4 gives the pseudocode for the proposed router, which is based on a modified pathfinder negotiated congestion/delay algorithm [37] . Flavors of pathfinder router are also used both in commercial (i.e., Xilinx and Altera), as well as academic tool flows [11] - [14] , [20] .
Initially, all signals in a placed design are routed in the best manner possible (e.g., minimum delay/wire-length), permitting shorts between the signals (i.e., two or more signals may use the same wire). Then, the penalties associated with the shorts are gradually increased, and the signals are rerouted according to the lowest cost path available, in order to avoid shorts where possible. The process of increasing the penalties for shorts and reroute signals continues iteratively until all shorts are removed and the routing is feasible.
Equation (10) gives the cost metric for quantifying a routing track h that forms a connection from source u to sink v. The value of parameter d equals to 0.4, as it minimizes the EDP for the exploration depicted in Fig. 14 
We define the criticality of a connection, crit(i, v) as where D max is the delay of the circuit's critical path and slack(i, v) refers to the slack of the connection between the source and sink v of net i. The exponent ϑ, as well as the MaxCrit are parameters that control how a connection's slack impacts on the congestion-delay tradeoff in cost function [depicted at (10) ]. The delay(h) refers to the Elmore's delay for routing track h, while act(i) is the switching activity for net i. Moreover, the cb(h), ch(h), and cp(h) correspond to the base cost, the historical congestion cost, and the present congestion cost of routing track h, respectively. Hence, the cost of using a track is a function of its current overuse and any overuse that occurred in prior routing iterations [see (11)]. Regarding the cost of an oversubscribed routing resource cp(h), it is gradually increased after the completion of each iteration to discourage resource sharing. This selection forces networks with alternative routes to avoid using the oversubscribed resource, leaving it to the net that most needs it. Finally, the parameter ε (derived from the ANN) scales the cong(h) according to the availability of TSVs and the value of hyperedge-cut. 5 Higher values of ε favor applications with a small hyperedge-cut; hence additional TSVs might be utilized to improve performance metrics without leading to unroutable designs.
E. Evaluation and Power Analysis
The last step in our framework deals with the evaluation of application's implementation onto the target 3-D FPGA. Toward this direction, a number of well-established models in academia are employed. More thoroughly, the maximum operation frequency and the total wire-length are computed based on the Elmore delay model [38] and a wiring model [20] , respectively.
Furthermore, the power dissipation is a major concern for 3-D FPGAs [5] . Understanding the variation of this design parameter within these devices is the first step at developing power-efficient architectures and CAD tools. Although the problem of power dissipation was already noticed both from industry [1] , [3] and academia [4] , up to now there are no tools to explore the impact of different selections at the power/energy consumption. In order to overcome this limitation, we have developed a systematic procedure, depicted in Fig. 15 , to perform power estimation for designs mapped onto 3-D FPGAs. Having as inputs the routing file, the switching activity per net, the maximum operation frequency, and a number of technology parameters (e.g., power supply and transistor size), it is feasible to calculate both static and dynamic power consumption for a design. The switching activity per net is computed based on the well-established in academia ACE 2.0 tool [34] , [35] .
Particularly, the three previously mentioned models for delay, wire-length, and power estimation were appropriately extended to be aware of 3-D reconfigurable architectures. Toward this direction, the platform's resource graph was annotated with the TSV connections. The procedure depicted in Fig. 15 is flexible, in that it can estimate the power consumption for a wide variety of 3-D reconfigurable architectures; hence throughout this paper it was also applied to the rest (reference) tool flows [12] , [13] . Although more accurate techniques for power estimation can be found in relevant publications, this is beyond the scopes of this paper because these techniques are either not general-purpose (i.e., applicable to different 3-D architectures), or they impose mentionable computational complexity. On contrast, the proposed activitybased approach is fast enough, as it does not require extensive simulations.
V. PERFORMANCE EXPLORATION AND
COMPARISON RESULTS This section quantifies the efficiency of introduced framework underlying compared to the state-of-the-art relevant algorithms. The underline 3-D FPGA follows the well-established hierarchical island-style architecture discussed at Section II. Two flavors of our framework, with and without fast placement, mentioned as fast and normal, respectively, are evaluated against to two publicly available flows, where netlist partitioning is computed with simulated annealing [12] and hMetis [13] algorithms, respectively. Similar to our framework, the reference flows rely on simulated annealing placer and pathfinder router. For comparison purposes, we quantify also the efficiency of the normal flavor without the algorithmic customization feature (referred as "normal without ANN"). In such a case, the value of each of the algorithmic parameters presented in Table III is set equal to the average value of its respective range of possible values (depicted at the last column of Table III) . Unfortunately, we cannot provide comparisons against to the rest flows summarized in Table I (MEVA-3D [11] , VPR3D [14] , and 3-D-tree [15] ), as these tools are not publicly available.
The evaluation is performed with the 20 largest MCNC benchmark circuits [29] . In order to further increase the complexity of these benchmarks, a special preprocessing step is applied by appropriately combining them in groups of 10 with a technique discussed in [39] . This results to an average complexity of 5910 CLBs per benchmark. 6 Although more complex benchmark suites are available (e.g., Titan [40] ), these benchmarks are not compatible to existing TPR-based flows [12] , [13] , as they assume devices with heterogeneous blocks (e.g., memories, DSPs, and CPUs). Table IV summarizes the array size per benchmark when different 3-D FPGA devices are considered. These sizes refer to the minimum array 6 On average each benchmark consists of 29 496 four-input LUTs. required for successful P&R with existing tools [12] , [13] , 7 whereas for sake of completeness the target platform per benchmark is identical among flows. Additionally, whenever it is not mentioned explicitly, the results discussed at this section are average values among the 20 MCNC benchmarks, while for demonstration purposes these results are plotted in normalized manner over the corresponding design metric for the 2-D FPGA. The results for the 2-D FPGA are computed based on a widely accepted software in academia, named VPR [20] , which also relies on simulated annealing placement and negotiated pathfinder routing algorithms. Finally, in order to avoid the impact of CAD noise [41] , all the algorithms are tuned with constant seed values.
A. Analysis of Utilized TSVs
The efficiency of introduced framework relies on manipulating in a better way than existing tools the fabricated TSVs. In order to study more thoroughly this parameter, Fig. 16 plots the additional utilization of TSVs for the alternative flavors of the proposed framework, in comparison to the results discussed in Fig. 3 . This analysis exhibits a predictable but reasonable picture. Specifically, the utilization of TSVs increases monotonically with the number of device layers compared to the state-of-the-art TPR tool [13] . This occurs mainly because the refinement loops discussed in Fig. 4 enable our framework to estimate more accurately the actual demand for interlayer connections during the partitioning and placement algorithms. Thus, it is possible to tune appropriately the algorithm's aggressiveness to form more or fewer connections between CLBs assigned to different layers. Precisely, the introduced framework (normal flavor) utilizes on average 19%-45% additional TSVs in comparison to the reference flows, which in turn enables the employed CAD algorithms to form shorter routing paths.
Apart from the total number of utilized TSVs, their spatial distribution over each layer is also important for addressing the routing congestion problem. Fig. 17 gives the variation of this parameter regarding the ex1010×10 benchmark mapped onto an FPGA with three layers. For this analysis, both existing TPR [13] [ Fig. 17(a) ] and normal flavor of our framework [ Fig. 17(b) ], are employed.
Each point (x i , y i ) of this contour graph depicts the percentage 8 of utilized TSVs inside a 3-D SB placed on spatial location (x i , y i ). This graph indicates that the utilization of TSVs varies between two arbitrary points (x 1 , y 1 , z 1 ) and (x 2 , y 2 , z 2 ) of the device, even for SBs placed on adjacent locations within the same layer. Such a conclusion, in conjunction to the hardware uniformity of FPGAs, leads to a mentionable wastage of fabricated TSVs, especially for the existing flows that rely on a min-cut netlist partitioning. More specifically, only a small portion (about 5% on average) of the SBs utilize all their TSVs (i.e., utilize W V TSVs per SB) when existing algorithms are employed to perform physical implementation. On the other hand, the introduced framework 8 As we have mentioned at Section II, there are three TSVs per SB (W V = 3). achieves to increase this percentage by 3-4× in comparison to the results depicted in Fig. 17(a) . Note that our framework does not lead to unroutable designs, since the majority of SBs utilize on average 50%-75% of the fabricated TSVs.
B. Evaluation of Performance Enhancement
The primary advantage of using a 3-D architecture is the spatial locality (i.e., reduced Manhattan distance) among resources assigned to different layers, which in turn is expected to improve application's critical path delay and power consumption. In order to study more thoroughly this topic, Fig. 18 plots the total wire-length for successful P&R. Based on this analysis we might conclude that the introduced framework outperforms existing TPR-based flows [12] , [13] , as we achieve an additional reduction at total wire-length on average 9%-14%. These gains are due to the better manipulation of TSVs during the netlist partitioning and routing algorithms. Furthermore, we have to highlight that the algorithms of our framework surpass existing tools, even without customization. Specifically, based on Fig. 18 , the flavor without ANN achieves average wire-length reduction by 7% and 3%, as compared to the "TPR (sim. annealing) [12] " and "TPR (hMetis) [13] ," respectively.
The shorter routing paths impose lower resistance (R) and capacitance (C) values, which in turn lead to smaller critical path delay. To study this metric, Fig. 19 gives the average variation of maximum operation frequency for different 3-D FPGAs. The timing analysis both for the 2-D and 3-D devices is performed with Elmore delay model [38] , in view of its high fidelity with respect to the actual circuit delay [42] .
The results of this analysis indicate that the introduced framework achieves higher operation frequency than the TPRbased flows [12] , [13] . More specifically, the normal flavor achieves an average increase to the clock's speed by 35% and 53%, as compared to the TPR (sim. annealing) [12] and TPR (hMetis) [13] flows, respectively. These gains are inline with those already discussed about the wire-length reduction due to the better manipulation of fabricated TSVs.
The wire-length reduction leads also to power saving, as it is depicted in Fig. 20 . For this analysis, each of the alternative flows is evaluated with the procedure proposed in Fig. 15 . Based on this graph we conclude that our framework outperforms existing flows. Specifically, the normal flavor achieves additional power savings in comparison with TPR (hMetis) [13] by 47% on average, while the corresponding power reduction as compared to the TPR (sim. annealing) [12] is almost 2×.
Moreover, it is well-worth to mention that the gains in terms of maximum operation frequency and power consumption are higher, when the proposed framework incorporates the feature of algorithmic customization through the ANN. Precisely, in such a case, the average improvement for the critical path delay and the total power consumption are 26% and 53%, respectively, compared to the case where the ANN is not employed. Such a conclusion highlights the necessary to perform algorithmic customizations, similar to those discussed throughout this paper.
Note that the power savings discussed previously are complementary to the gains in application's maximum operation frequency. Hence, one might expect that the proposed framework improves further either the critical path delay under the same power budget, or the power savings for identical operation frequency. This topic is studied in Fig. 21 , where the average power consumption per benchmark is computed assuming the minimum clock frequency among frameworks. From this analysis we might conclude that our framework achieves to reduce further the average power consumption as compared to the state-of-the-art TPR (hMetis) [13] by 64%.
Apart from the performance metrics discussed so far, the execution run-time of CAD algorithms is also of high importance. In order to study more thoroughly this aspect, Fig. 22 plots the execution run-time for performing physical implementation regarding the alternative flows. These results indicate that the fastest tool flow is the TPR (hMetis) [13] , which requires almost the half time as compared to the existing TPR (sim. annealing) [12] ) tool. Regarding the normal flavor, it imposes almost an 1.6× overhead at execution run-time versus the fastest tool flow. This penalty occurs mainly due to the additional computational complexity imposed by the employed tabu partitioning algorithm, as compared to the hMetis-based approach found in [13] . One more conclusion might be derived from this analysis. Specifically, there is a monotonic increase at the execution run-time as we employ 3-D architectures with additional layers. Although the additional layers result to smaller array sizes per layer, as it is depicted in Table IV , the number of TSVs increases the complexity for netlist routing. Finally, it is well-worth to highlight that there is no penalty for the algorithmic customization, since the execution of ANN is very fast after it has been trained.
VI. CONCLUSION
A novel framework for application implementation onto 3-D FPGAs, was introduced. This framework incorporates mechanisms to customize the employed CAD algorithms in order to better manipulate constraints posed either by the target application, or the underline 3-D architecture. Experimental results with various benchmarks prove the effectiveness of the proposed solution, compared to existing state-of-the-art flows in academia. Specifically, we achieve average gains in terms of total wire-length, critical path delay, and power consumption by 9%, 35%, and 47%, respectively.
