We describe the design and exploration methodology to optimize 3-dimensional (3D) Heterogeneous Tree-based FPGAs (HT-FPGAs) by introducing a break-point at a particular tree level interconnect to optimize the speed, power consumption and area. The ability of the flow to decide a horizontal or vertical partitioning of the programmable tree network based on design specifications is a defining feature. The break-point of the vertically partitioned tree is designed to balance the placement of logic blocks and switch blocks into multiple tiers while the horizontally partitioned tree is designed to optimize the interconnect delay of the programmable tree network. We finally evaluate the performance, area and power of the proposed 3D HT-FPGA using the newly developed flow and show that vertical and horizontally partitioned 3D stacked HT-FPGA improves speed by 16% and 55% respectively compared to 2D planar design.
INTRODUCTION
Modern Field Programmable Gate Arrays (FPGAs) have become a viable alternative to cell-based design technology by providing reconfigurable computing platforms with improved performance and higher density. While the reconfigurability provides flexibility, FPGAs also lead to area and performance overhead in comparison to cell-based custom integrated circuits (ICs). Thus to combine the advantages of both FPGAs and custom ICs, heterogeneous FPGAs have emerged as an attractive solution for system-on-chip implementations. The heterogeneous FPGAs include design components such as digital signal processors, on chip memory blocks, multipliers, adders, and entire processors. Examples of such heterogeneous FPGAs include Xilinx Virtex 5 and 6 FPGA, Altera's Stratix II, III family and the latest 2.5D heterogeneous Xilinx Virtex 7 FPGAs.
To provide the required reconfigurable functionality, FPGAs provide a large amount of programmable interconnect resources, which consumes 80-90% of the total FPGA area [10, 6] . Since die area is one of main factors that determine the manufacturing costs, reducing the silicon footprint of the programmable routing resources may lead to significant improvement in performance and manufacturing cost. Three-dimensional integration technology with Through Silicon Vias (TSV) has the potential to reduce the programmable interconnects length by bringing the logic components close together, which leads to significant improvement in functionality, scale of integration, silicon area and performance of integrated circuits. In an interconnect-dominated FPGA, 3D integration can address problems pertaining to routing congestion, limited I/O connections, low resource utilization, and long wire delays. In 3D integration multiple dies are integrated and interconnected using TSVs. The impact of TSV count and location in 3D design and different stacking methodologies are described in [9] .
A number of recent publication proposed novel 3D architectures and physical design methodologies to improve FPGA performance compared to existing planar FPGAs [10, 6, 1, 11] . There are two major types of 3D FPGA architecture found in the literature. The first one is developed by monolithic stacking, whereby the active devices are lithographically built in between metal layers [6] and the second type is evolved from original 2D structure by extending the 2D switch boxes (SBs) to 3D ones [11, 1] . So far, there are two design and exploration frameworks targeting 3D FPGA architecture: the three-dimensional place and route (TPR) [1] and 3D MEANDER [11] . In TPR, all SBs are assumed to be 3D-SBs and the number of TSVs is assumed to be unlimited, which is an impractical assumption as far as the design and manufacturing of 3D chips is concerned. Meanwhile 3D MEANDER is a fully-fledged design framework for 3D FPGAs and it provides the capability of analyzing the impact of different deployment strategy for 3D-SBs. It proposes a family of 3D FPGA architectures in which 2D-SBs and 3D-SBs are intermittently used in certain regular spatial patterns. Nonetheless the number of available TSVs within 3D-SBs is assumed to be fixed and that means the design do not investigate the impact of different numbers of TSVs in a 3D-SB. A mesh-based heterogeneous 3D FPGA architecture optimization is presented in [5] . 
MOTIVATION AND PROBLEM FORMU-LATION
In a Tree-based FPGA architecture [7, 3] , the Logic-Blocks (LBs) and Hard-Blocks (HBs) are grouped into clusters located at different levels. Each cluster contains a switch block to connect local LBs, I/Os and HBs. A comparison of performance and placement of HBs on 2D heterogeneous Treebased and Mesh-based FPGA is presented in [3] . The experimental results presented in [3] lead to the conclusion that the placement of HBs at the higher levels of the Treebased programmable routing network gives better trad-off between area and speed of FPGAs. Heterogeneous Treebased FPGA (HT-FPGA) architecture unifies two unidirectional upward and downward interconnect networks using a Butterfly Fat-Tree (BFT) topology to connect Downward mini switch blocks (DMSB) and Upward mini switch blocks (UMSB) to LB's and HB's inputs and outputs. A twodimensional layout design of Tree-based FPGA is a challenging task, since the wire length increases exponentially as the tree grows to higher levels [8] . As illustrated in Figure 1 , we propose two types of 3D stacking methodologies using horizontal and vertical partitioning of HT-FPGA to improve density and network delay of 3D heterogeneous BFT-based interconnect network. Figure 1 illustrates the representation of horizontal and vertical partitioning methodology of HT-FPGA using a 3 level, arity 4 HT-FPGA architecture.
In the case of horizontal partitioning, the BFT-based programmable interconnect network is partitioned at a particular tree level called the break-point and interconnected using TSVs to optimize network delay. In this case the logic density and interconnects below the break-point are placed in tier 1 and the interconnect networks along with I/Os and HBs above the break-point are placed at tier 0 of the 3D stacked chip. On the other hand in the vertical partitioning, as illustrated in Figure 1 , the location of the break-point is fixed at the highest tree level. In this case the logic units and interconnect networks including HBs are equally partitioned into multiple tiers of the 3D stacked chip. Thus, the silicon area and power consumption of partitioned dies are balanced and design complexity is reduced. The horizontal partitioning method provides better speed and additional design flexibility to re-distribute the I/Os, HBs and higher level switch blocks to optimize the network delay and interlayer heat dissipation of tier 0. A 7-level horizontally partitioned 3D homogeneous Tree-based FPGA architecture with break point placed between level 3 and 4 is proposed in [8] .
In the homogeneous 3D design, we have 20% whitespace silicon area in tier 0 of 3D stacked chip due to a mismatch between partitioning of logic density and programmable interconnects into multiple active layers.
SUMMARY OF RESULTS
In this paper we focus on introducing HBs along with programmable interconnects levels that are placed either in tier 0 or 1 based on the type of partitioning being used. In horizontal partitioning methodology, the HBs are positioned at levels above the break-point to be placed in tier 0. In vertical partitioning, HBs are distributed equally between both tiers of the 3D stacked chip regardless of the tree levels where they are connected. We propose design and exploration methodologies to improve the speed, area and power consumption of 3D HT-FPGA using vertical and horizontal partitioning of the programmable tree interconnect network. Using Rent-based analytical wire length distribution models [7] , we propose an optimization methodology to minimize total TSV count and programmable routing resources. Using an extensive set of large benchmarks from OpenCores, VTR Toronto and Altera, we analyze the speed, power consumption and area of the 3D stacked HT-FPGA. Using a comprehensive experimental setup we show the 3D HT-FPGA is 16.3% and 55.8% improvement in speed for vertical and horizontal partitioning method compared to the 2D counterpart. This paper is organized as follows. Section 4 describes the experimental and software flow developed in our laboratory to design and evaluate 3D HT-FPGA. Section 5 explains the physical design and implementation of vertical and horizontally partitioned 2-tier 3D HT-FPGA. Section 6 discusses the experimental methods and analysis of 3D HT-FPGA architecture optimization. Section 7 explains power and thermal optimization analysis and finally section 8 concludes the paper.
EXPERIMENTAL FLOW
The proposed experimental flow for design and exploration of 3D HT-FPGA architecture is illustrated in Figure 2 . The HDL code generator is designed to generate VHDL code based on a hierarchical design approach that partitions the design into smaller sections, which implement clusters separately and assemble them together at the final design phase. The physical design experiments are performed using the layout generated with Global Foundries 130nm technology node (Tezzaron Design platform) [8] . Mentor's circuit simulator Eldo is used to estimate the wire delay and power consumption of switches and interconnection networks at Tree levels [8] . The thermal model presented in [2] is augmented to extract the thermal profile of the multi-layer chip based on layout geometrical features and power consumption of the FPGA functional units. The 2D physical design and 3D floorplan development of Tree-based FPGA is presented in [8] . The floorplan tool is augmented to include flexibility to create horizontal and vertical partitioning in the Tree-based interconnect network based on the 3D HT-FPGA timing and design specifications.
The design netlist are obtained in .NET format. The LUTs, HBs and I/Os are first partitioned into different clusters in such a way that the inter-cluster communication is minimized. After completing the partitioning, a placement file is generated. It contains positions of different blocks in the architecture. This placement file along with the netlist file is then passed to 3D router, which is responsible for routing the netlist. The router is based on PathFinder [7, 3] routing algorithm that uses an iterative, negotiation-based approach to successfully route all nets in circuit netlist. The 3D timing analyzer generates a direct acyclic timing graph of the routed circuit to evaluate the speed of 3D HT-FPGA. Based on routing result, the different sub-paths are identified and each edge is annotated with delay of corresponding sub-path. The graph edges between active layers of the 3D stacked HT-FPGA annotate corresponding TSV delay to the pins which circuit specifies as 3D nets. To optimize the TSV count and routing resources, a Rent-based wire length distribution model, integrated into 3D router is used. The TSV count optimization methodology is presented in section 6.2. After completing the TSV count minimization and architecture optimization, the Router estimates the critical path delay, TSV count, area and power consumption of the optimized 3D HT-FPGA.
3D DESIGN METHODOLOGIES
We defined two types of partitioning for 3D HT-FPGA as illustrated in Figure 1 : 1) by introducing horizontal partitioning at a particular tree level called the break-point to optimize interconnect delay and 2) by introducing vertical partitioning to balance the silicon area and power consumption of 3D chip.
Vertical Partitioning
The main focus of the vertical partitioning method is to partition the total power consumption and silicon area equally between the active layers of the 3D stacked HT-FPGA chip. The logic blocks plus HBs and programmable routing resources are equally partitioned into a multiple dies stacked 3D chip. The break-point is set at the highest level of the Tree and interconnected using TSVs as illustrated in Figure 3 . Since the break-point is set at highest Tree levels, the TSV count required for vertical interconnection increases. The vertically partitioned test chip with 7 Tree levels and 16K LUTs require 20480 TSVs for a fully connected (Rent=1) 2 tier 3D HT-FPGA. With reference to section 6.2, we reduced 7782 (38%) TSVs by using Rent based TSV count minimization model. The wire delay increases exponentially as the Tree grows to higher levels. The longest wire located at the highest level of HT-FPGA is replaced by TSV and only limited wire length optimization is possible at other levels. This makes the vertically partitioned 3D HT-FPGA 3.3 times slower compared to horizontally partitioned HT-FPGA. Nevertheless, the advantages of vertical partitioning method compared to horizontal are reduced design complexity, area by 50% and balance power consumption across tiers of 3D stacked HT-FPGA.
Horizontal Partitioning
In horizontal partitioning methodology, the location of the break point is decided based on optimization of interconnect network delay. The interconnect delay of tree network increases exponentially [7, 8] as the tree grows to higher levels. The setup used for wire length estimation and delay measurement is reported in [8] . The measured interconnect delay for 2D and 3D layouts are illustrated in Figure 5 . The horizontal partitioning is designed to stack the programmable interconnect resources of HT-FPGA on top of the LBs and vertical interconnect layers are implemented using Tezzaron's TSV technology. Figure 4 shows the 3D layout representation of horizontally partitioned HT-FPGA with LUTs and local programmable interconnects placed at tier 1 and higher level programmable routing resources along with I/Os and HBs connected to Tree level 4, 5 and 6 placed at tier 0 of the 2-tier 3D stacked chip. The horizontal break-point is set between Tree level 3 and 4. The location of the break point is decided based on the delay measurements of tree levels illustrated in Figure 5 , in which the delay between level 3 to 4 is greater than 2ns. We have approximately 20% white space in tier 0 of horizontally partitioned homogeneous Tree-based FPGA due to unbalanced hardware partition. While designing 3D HT-FPGA, we used the white space available at tier 0 to stack Hard-blocks at multiple levels of Tree-based programmable interconnect network placed in tier 0. Previous experiments conducted in our laboratory lead to conclusion that, placing hard-blocks at higher levels of the Tree-based interconnect network can lead to a better trade-off between area and speed of 3D HT-FPGAs [3] . In our horizontally partitioned 2 tier 3D HT-FPGA test chip, the horizontal break-point is placed between levels 3 and 4. The test chip contains 16K LUTs placed in tier 1 with 4096 vertical inputs pins and 1024 vertical output (feedback) pins. For a fully connected 3D test chip requires 5120 TSVs communication between tier 0 and 1. With reference to the TSV count minimization model described in section 6.2, we reduced 2099 (41%) TSVs. We used six-metal 130nm process provided by Global Foundries that is modified to include TSVs according to the specification of Tezzaron Semiconductor. Tezzaron process produce very small TSVs that are approximately 1.2µm wide with 2.5µm minimum pitch and 6µm height [4] . The area around the TSV has been expanded to include keep out zones [4] to make TSVs fit within 4 to 6 standard cell area, which is essential to maintain the The measured values of TSV resistance RT SV and capacitance CT SV are ≈ 600mΩ and 15f F respectively. The wire delay estimation of tree levels for the 3D stacked HT-FPGA is extracted from the 2 tier layout developed using Tezzaron Process. The break point delay is optimized using the TSV model from [8, 9, 11] . In tier 0, the location of TSVs and interconnect switches along with HBs are re-distributed to optimize the wire delay at higher levels of the Tree.
3D Physical Design Flow
The physical design process begins with the RTL description of Tree-based FPGA generated using a VHDL code generator as illustrated in section 4. In the case of horizontal partitioning, tier 1 contains LUTs and local programmable interconnects from levels 0 to 3 (design1) and tier 2 contains programmable interconnect above the breakpoint along with Hard-blocks (design2) as illustrated in Figure 4 . We then used cadence design compiler to compile VHDL into structural Verilog for each die. The compiled Verilog is then input into Cadence Encounter to perform semi-automated physical design steps. Due to the limitation of the design tool, we used Face-to-Face (F2F) stacking methodology with I/O interconnected using TSVs. We also used many add-on tools integrated in Encounter to perform early analysis on the design before sign-off analysis is undertaken. We used the GDS-merger tool integrated into Tezzaron 3D flow to merge the design1(tier0) and design2 (tier1) and TSV tool to assign TSV connections. Figure 6 shows the 3D design flow along design partitioning, merging tier 0 & 1 and design sign-off analysis. Figure 7 shows the TSV assignment for I/O pads on tier 0 of the 3D chip before flipping and tier 1 die which implements the logic blocks.
EXPERIMENTAL RESULTS
Evaluation of vertical and horizontal partitioning methodologies of 3D HT-FPGA architecture is performed using the experimental flow described in section 4. Since the benchmark circuits plays a major role in the FPGA exploration, three sets of benchmarks [3] were chosen based on trends of communication between different blocks. In SET I benchmarks, the major percentage of the total communication is between HBs, in SET II the major percentage of the total communication is between HBs and LBs and in SET III the major percentage of the total communication is covered by LBs alone. The benchmark circuit details regarding the type and count of adders and multipliers listed in [3] .
Performance Analysis
To validate the performance of 3D HT-FPGA architecture, we used a fully connected HT-FPGA architecture with 7 levels and arity 4, (4x4x4x4x4x4x4) for each SET of benchmarks. After completing the partitioning, the individual netlist are placed and routed using the experimental flow described in section 4 excluding the architecture optimization section. The performance analysis of vertical and horizontally partitioned 3D HT-FPGA reported are in Table 1 . In the case of horizontal partitioning, the performance gains measured for SET I, II and III are 51.7%, 55.8% and 50.2% respectively compared to 2D planar design. For the vertical partitioning method, the respective performance gains measured for SET I, II and III benchmarks are 9.7%, 16.3% and 14.1%. As explained in section 5.1, the vertically partitioned 3D HT-FPGA is optimized to reduce silicon footprint.
Area And TSV Count Optimization
To optimize 3D HT-FPGA architecture, the experiments were performed individually for each netlist including architecture optimization section described in section 2. The architecture definition, partitioning, placement, routing and optimization are performed individually for each netlist listed in Table 1 . To make 3D HT-FPGA more efficient in terms of design and manufacturing, it is essential to minimize the TSV count because TSV consumes more silicon area than horizontal interconnects [9, 5] . The TSV and architecture optimization are performed based on Rent's parameter [7] "p" defined for a Tree-based heterogeneous architecture as shown in equation 1. The Tree level is represented as and k is the cluster arity, c is the number of in/out pins of an LBs, ax is the number of in/out pins of a HBs of type x, x is the level at which the HBs are located, bx is the number of HBs at that level, z is the number of HBs supported by the architecture and IO is the number of in/out pins of a cluster located at level . Since there are more than one type of HBs, their contribution is accumulated and then added to the LB(p) of equation 1 to calculate p. The value of p determines the total number of interconnects at each level of the Tree-based architecture and it is averaged across all the levels to determine the p for the architecture.
The optimization program considers architecture break point level with different p values. The purpose is to find for all benchmark circuits, the architecture with the smallest necessary TSV interconnects at the break point level. Using equation 1, the p value is calculated for each iteration and once the break-point level optimization is finished, the optimizer randomly chooses other tree levels to optimize the routing resources. Table 2 presents the TSV and architecture optimization results. An average reduction of 40.1% and 38% TSVs recorded for horizontal and vertical breakpoints. An average speed degradation of 1.7% and 0.8% observed in horizontal and vertical break-points compared to 6.8% reduction in Mesh-based horizontal partitioned FPGA for the same logic resources. The optimized silicon area for Rent=1, Power_2D Rent=p, Horizontal_Power_3D
Rent=p,Vertical_Power_3D
Figure 8: Static power estimation and analysis of 3D stacked heterogeneous Tree-based FPGA individual interconnect levels is reported in Table 2 . Using our optimization flow, total interconnect area is reduced by 37% compared to the fully connected architecture. Figure 8 shows the interconnect power at different levels of the 3D HT-FPGA. The Rent parameter based architecture optimization shows 35.13% reduction in total power consumption of 7 level Tree-based 3D interconnect network. This is very promising for FPGA architecture in terms of power and silicon area, since FPGA is an interconnect dominated architecture and it is impossible to manufacture it with a huge number of TSV and switches. The inter-layer temperature is optimized by considering TSV area and location [2] . The 3D thermal model considers the impact of copper TSVs while estimating the temperature profile. The effective thermal conductivity of active layers in 3D stacked chip is calculated by equation 2 k ef f = kcu.(T SVArea) + K th .(LevelBP Area − T SVArea) (2)
POWER AND THERMAL OPTIMIZATION
The kcu and K th are the thermal conductivity of copper and silicon active layer. The measured peak temperature of 2D HT-FPGA is 351K and average temperature is 346K. With our localized rearrangement of HBs and switch blocks along with TSV area, the peak and average temperature optimized at 355K and 351K respectively for 3D stacked HT-FPGA.
CONCLUSIONS
An innovative design and exploration methodology for 3D HT-FPGA along with experimental results has been presented. The horizontal and vertical partitioning design methodology based on design specification is a defining feature. A timely 3D HT-FPGA architecture and TSV count optimization methodology has been introduced. A reduction of 37% in overall interconnect area and TSV count reduced by 41% for horizontal and 38% for vertical breakpoint. The experimental analysis shows horizontal partitioning method is 3.3 times better in performance. These results place 3D Tree-based FPGA architecture as a viable alternative to build 3D high performance re-configurable heterogeneous systems.
