Abstract-Thermal management is one of the critical issues in 3D many-core processors design. 3D many-core floorplanning has so far focused on only the configuration of cores and memories across layers. However, 3D floorplanning should also take die stack ordering into account because the characteristics of dies may vary due to growing process variations. A new 3D floorplanning approach which covers die stack ordering is proposed. The evaluation shows that peak steady state temperature is reduced by about 2 K without any overhead in manufacturing process.
I. INTRODUCTION
IC design has been facing rising demands for higher performance, more functionality integration, and at the same time, low power consumption. However, miniaturization nearly reaches the upper limit due to the increase in leakage current. 3D stacking technology is a promising alternative, which enables even higher integration without an advance in process technology. Especially, 3D stacking with throughsilicon-vias (TSVs) widens bandwidths and reduces wire lengths so improves performance and lowers power consumption as well.
Computer architecture is now shifting from multi-core to many-core due to the power wall [1] . Introducing 3D stacking technology into many-core design is beneficial in terms of both area and performance: A larger number of cores can be integrated in a processor and the communication latency between cores can be considerably decreased. By introducing 3D stacking, more various design options have to be addressed at the early design stage. Recently, 3D many-core processor design has been studied from the aspect of cost [2] . The wafer, bonding, package, and cooling costs are modeled and as a result, the optimal system-level partitioning strategies are proposed. The system-level design guides such as [2] can effectively tackle new design issues in 3D many-core processors.
One of the most challenging design issues of 3D ICs is thermal management. The heat generated from a layer may raise the temperature of the vertically adjacent layers, and moreover, the heat generated from inner layers is difficult to be dissipated. Therefore, thermal management has to be taken into account from the early stage of 3D IC design. Thermalaware floorplanning for 3D many-core processors has been studied in [3] , and [4] . They focus on partitioning cores and on-chip memories across layers. The studies are based upon the assumption of the symmetry in terms of performance and power dissipation of all cores and dies in many-core systems. With growing process variation, however, performance and power dissipation asymmetry between cores and dies may make huge differences [5] . Therefore, floorplanning for 3D many-core processors has to include not only the configuration of each layer, but also the issues from even earlier design stage; die stack ordering. We show how die stack ordering affects the efficiency of thermal management of 3D many-core systems and provide a guide to the die stack ordering.
II. PRELIMINARIES
This section provides some preliminaries and experimental settings and methodologies.
A. 3D Stacking
3D stacking is categorized into three techniques: Wafer-towafer (W2W), die-to-wafer (D2W), and die-to-die (D2D). W2W bonding first stacks layers of wafers, then slices and packages into a single 3D IC. D2W bonding has a base wafer and stacks the rest of dies on it afterwards, the base wafer is sliced and packaged. D2D bonding stacks layers of knowngood-dies (KGDs). 3D stacking is the key factor to the yield of 3D ICs because a faulty layer may spoil the whole 3D IC and several good layers stacked with the faulty one become obsolete. Although W2W bonding is the simplest technique, it may considerably reduce the yield because dies are not tested before bonding. Thus, D2D bonding is desirable unless the cost of a single die is very low.
Many-core processors usually consist of identical cores and memories, so each layer of 3D many-core processors is often identical. Ideally, there is no difference in performance or power dissipation between dies, but the characteristics of every can never be the same in practice due to process variations. Thus, die stack ordering makes differences in various characteristics of 3D many-core processors, especially, thermal characteristics. This design issue can be addressed in floorplanning stage and adopted before the dies are stacked and bonded.
193

IEEE Electrical Design of Advanced Packaging and Systems Symposium EDAPS ( )
B. Architecture
We model a 4-layer 32-core many-core processor. Fig. 1 (a) illustrates a simple floorplan of the 3D many-core processor. The number of on-chip memories is the same as that of cores. Cores and memories can be either in the same layer or in different layers. The diverse layer configurations and its impact on thermal characteristics have been already studied in [6] . Generally, placing cores and memories in different layers is beneficial to thermal management because memories consume less power than cores so the heat generated from core layers can be transmitted and easily cooled down.
In this paper, we do not focus on the configuration of each layer, but die stack ordering. In order to clearly show the effect of stack ordering by comparing that of layer configuration, our target architecture has two configurations. Each layer of the first configuration has eight cores and eight memories as shown in Fig. 1 (b) . Cores and memories are placed so that the same type blocks are not either horizontally or vertically adjacent. The layer of the configuration in Fig. 1 (c) has only either cores or memories. We name the configurations chess, and pure, respectively.
The core is modeled based on the SPARC core in Sun Microsystem's UltraSPARC T2 [7] , which is manufactured at 65 nm technology. The SPARC core is relatively small and has simpler structure so that it could properly model the core of many-core processors. Cores and on-chip memories generally occupy most space in many-core architectures. Therefore, we do not model the other blocks, such as memory controller, average peak steady state temperature interconnect, etc, as illustrated in Fig. 1 (b) and (c).
C. Process Variation and Power Consumption
Process variation is exacerbated as technology scales down, which raises the inter-die and intra-die asymmetry on performance and power dissipation: The former is die-to-die (D2D), and the latter is within-die (WID) process variation. Process variation is a combination of systematic and random components. Systematic variation is caused by the lack of accuracy in manufacturing processes, so it appears as spatial similarity. That is, transistors located close to each other have similar parameters. WID variation can be regarded as core-tocore (C2C) variation in many-core architectures. In many-core systems, one core is small enough to ignore the systematic differences between each core, so intra-core variation is neglected.
Process variation results in the deviation of power dissipation of each core and memory in many-core architectures. McPAT [8] is a simulation framework which models power, area and timing of multi-core and many-core systems based upon architecture description, hardware activities, and process technologies. The nominal dynamic power consumption of the SPARC core is estimated as 3.5 W at 1.4 GHz and 1.1 V, and that of on-chip memory is 1.2 W. We intentionally change some process parameters so as to model the power consumed by cores and memories under process variations. We assume normally-distributed D2D, and C2C variations with ı / ȝ = 5 %. Using McPAT, we also model the area of the SPARC core and memory. The area of core is 14.5 mm 2 , and memory is 10.6 mm 2 , which is well modeled considering that the SPARC core in the Rock processor at 65 nm process is 14 mm 2 . For the sake of simplicity in floorplan, we set the areas of both cores and memories at 14 mm 2 . The width and height of cores and memories are 4 mm and 3.5 mm, respectively.
D. Thermal Simulation
HotSpot [9] estimates temperature of a 2D microarchitecture using floorplan, package information and power consumption. We use the extended HotSpot [10] which is capable of modeling multiple die stacking. Most parameters are followed the default HotSpot configuration and some parameters affected by 3D stacking are changed. We set the thickness of silicon die at 0.1 mm, and that of interface material at 0.02 mm. The values of used parameters are listed in Table 1 . We assume that TSVs are evenly distributed on the die. The diameter of the TSV is 10 um, and the center-tocenter pitch is 20 um. We assume that 2 % of total die area is 
IEEE Electrical Design of Advanced Packaging and Systems Symposium EDAPS ( )
covered by TSVs. The resistivity changed due to TSVs was calculated as same as in [10] .
III. PROCESS VARIATION-AWARE FLOORPLANNING
Based on the 3D many-core architectures described in the prior section, two types of dies are required for each layer configuration: Dies which a core is placed on its bottom left (D chess_c ) and dies which a memory is place on its bottom left (D chess_m ) and for chess configuration, and dies composed of only cores (D pure_c ) and dies composed of only memories (D pure_m ) for pure configuration. Basically, D pure_c is put on the bottom layer because it consumes more power than D pure_m . It is unnecessary to fix D chess_m at the bottom layer as both D chess_c and D chess_m are a combination of cores and memories, but we always put D chess_m at the bottom layer so as not to incur additional package costs. Through manufacturing test, the dies are characterized and identified their performance. Rather stacking dies without using the information, we utilize the information to find the best die stacking order.
Firstly, dies are categorized into two groups by their power consumption characteristics. If the 3D many-core processor is targeting for high performance computing, we can select four high-performance power-consuming dies. However, power consumption may soar thereby dramatically increasing the cooling cost and reliability crisis. To form our 32-core manycore processor, we choose two dies from each type; one from high power consumption group, and another from low power consumption group. We call the former hot die, and the latter cool die. Therefore, the number of possible stacking orders is four in this case. Intuitionally, it is best to put the most powerconsuming die to the bottom layer because it is the closest to the air flow. It is meaningless to estimate all four cases to choose the best order because it is clear for inner layers to have difficulties to dissipate heat.
For chess configuration, we put hot D chess_m on layer 1, and cool D chess_m on layer 3. Hot D chess_c is put on layer 4, which is close to heat sink, and cool D chess_c on layer 2. Similarly, hot D pure_c , cool D pure_m , cool D pure_c , and hot D pure_m are sequentially stacked for pure configuration. The best stacking order differs according to the number of layers. In this paper, we only deal with the 4-layer many-core processors, but stacking order can be easily modified for the other numbers of layers.
It is unlikely for 3D many-core processor to have more than four layers based on the studies on manufacturing cost of 3D many-core processors [2] , unless the number of cores considerably increased. The cooling cost becomes unreasonable as the number of layers increases and more advanced process is used. Therefore, any complicated methodologies are not required for die stack ordering. The best order can be known according to the number of layers before the beginning of the manufacturing process, so it incurs no overhead to the manufacturing process. The evaluation results in the next section show that the improvement in the efficiency of thermal management by die stack ordering is larger than that by changing layer configurations.
IV. EVALUATION
In this section, we discuss the results of our analysis. We modeled 64 D chess_m , 64 D chess_c , 64 D pure_m , and 64 D pure_c so that 64 4-layer many-core processors can be formed in total; 32 of the many-core processors have chess configuration, and the rest 32 many-core processors have pure configuration. The characteristics of the dies differ from each other, as a result of modeling D2D process variations. The characteristics of cores and memories in a single die also differ, as a result of modeling C2C process variations. To clearly show the effectiveness of the proposed floorplanning, we also analysed the effect of random, which randomly stacking dies, and worst, the stacking resulting in the worst thermal characteristics: Placing the most power-consuming dies in inner-layers. Fig 2 shows the average of peak steady state temperature of 32 many-core processors for each chess and pure configuration. As expected, pure configuration is overall more efficient for thermal management. In chess configuration, the 
IEEE Electrical Design of Advanced Packaging and Systems Symposium EDAPS ( )
proposed process-variation floorplan (PV) decreases the peak steady state temperature by 1.34 K over random floorplan, and by 2.1 K over worst floorplan on average. In pure configuration, PV decreases the peak steady state temperature by 1.59 K over random floorplan, and by 2.17 K over worst floorplan on average. Thermal management aims to reduce thermal gradients as well as the peak temperature. The temperature gradients across all core and memory blocks in all layers are also evaluated, which is shown in Fig. 3 . The proposed PV reduces the thermal gradients between blocks compared with random and worst. In chess configuration, PV decreases the average temperature gradient by 0.66 K over random floorplan, and by 1.02 K over worst floorplan. In pure configuration, PV lowers average temperature gradient by 0.49 K over random floorplan, and by 0.88 K over worst floorplan.
These evaluations show that the benefit from introducing PV stack ordering is larger than that from changing the layer configuration from chess to pure. The average peak steady state temperature is 349.22 K where dies with chess configuration are stacked in a random order. In this case, introducing pure configuration reduces the average peak steady state temperature by 0.91 K, meanwhile introducing PV floorplan reduces by 2.1 K. When both layer configuration and die stack ordering are optimized, the average peak steady state temperature is decreased by 2.18 K.
The results shown in Fig. 2 and Fig. 3 are the averages on 32 many-core processors for each configuration. In order to analyze the results more precisly, the detailed peak steady temperatures of two many-core processors are listed in Table  2 . We selected two many-core processors whose peak steady temperature is close to the average of 32 many-core processors shown in Fig. 2 . In chess configuration, random and worst floorplans result in higher temperature on inner layers even though cores and memories are evenly placed. The odd-numbered layers in pure configuration are core dies, so their peak steady state temperature is higher. PV properly orders the dies based on the power dissipation information thereby successfully lowering the peak steady state temperature of layer 3.
V. CONCLUSIONS
One of the most critical issues in 3D many-core processors is thermal management. In this paper, the thermal management is improved by expanding the range covered by floorplanning, to make a die stacking order. We examined the impact of die stack ordering on temperature under process variations, and determine the best stacking order for 4-layer 32-core many-core processors. According to the number of layers in target processors, the order can be easily determined based on the fact that the hottest die should be put on the bottom layer and cool dies on inner layers. The proposed floorplan can be introduced without overhead to 3D IC manufacturing process. 
