ABSTRACT We have previously evaluated the feasibility of a serial code accelerator core with 3-D DRAM stacked on the core operating at high frequencies. While operating at such high frequencies (>24 GHz), there are concerns with removing heat from the 3-D stack. We propose the use of thin diamond sheets, which have high thermal conductivity, as a heat spreader by bonding it close to the processor core substrate and memory stacks. We show, through thermal modeling using COMSOL finite-element analysis tools, the feasibility of diamond as an effective heat spreader in a processor-memory 3-D stack.
I. INTRODUCTION
Programmers put in a lot of effort to improve application performance by parallelizing code and running it on multiple cores. However, Amdahl's law on speedup [1] shows that there is a limit to which this approach can continue to speed up applications because of the presence of serial code that cannot be parallelized. The authors have been motivated in tackling this issue by attempting to accelerate the serial code by building a high clock rate core in a technology that facilitates high speed circuit designs [2] - [3] . The authors have thus focused on building heterogeneous multi-core chips using a combination of a high speed core (to execute serial code) and many low-power cores (to execute parallel regions of the code).
The high-speed core to be built out of SiGe BiCMOS technology [4] demonstrates the requirement of ultra-high bandwidth to the last level of cache to maintain low CPI (clock per instruction). The authors of the paper have pursued this by evaluating stacking of 3D memory [5] on the high speed core. 3D technologies have provided through silicon vias (TSVs) for connectivity between chip layers to facilitate ultra-high bandwidth between layers.
There are concerns of heat dissipation in such a 3D stack especially with a core operating at high frequency. Non-removal of this heat could result in hot spots on the core layer causing heat to flow into the memory layers above. This ultimately can lead to increased clock skew, electromigration, and decreased device mobility. The authors have explored options to alleviate these concerns and propose using a thin diamond layer at the bottom of the processor memory stack to help in spreading the heat and facilitate its removal.
Section II reviews the SiGe BiCMOS technology and the hindrances it faces due to temperature and ways to reduce them; section III discusses the diamond material and the issues with using diamond. Section IV discusses the SoD 3D process. Section V reviews the Reduced Instruction Set Computer (RISC) architecture of interest for a fast serial code accelerator and its vulnerabilities to temperature. Section VI and VII contains the analysis and simulations of the serial code accelerator and its 3-D memory stacking with and without diamond heat spreaders, respectively. Section VIII analyzes the results of the thermal simulations. Section IX and X contains the discussion of possible future work and the conclusion, respectively.
II. SiGe BiCMOS
In developing a high speed serial code accelerator, one requires a fast switching device with high wire drivability. This is a device which does not demand an aggressive wire shrinking approach as CMOS, and in silicon there are a few choices. Fortunately, one of them is the SiGe heterojunction bipolar transistor (HBT) technology [6] .
The most important difference between the bipolar device and the FET is the output current flow. In the bipolar device, current flows vertically through the emitter, base and collector; a geometry shown in Fig.1 that lends itself to low on-resistance for the minimal sized device. This same feature also means that this current can be quite high which appeals to many designers due to its ability to drive wire capacitance. An important point to note is that the critical dimension for speed is the thickness of the base as this sets the transit time, which is the ultimate limit on switching speed for an unloaded device. Lateral scaling primarily affects the current required to overcome device parasitics, and hence the primary benefit for lateral scaling is lower power. In spite of this, the current at the peak of the f T vs. I c curve from generation to generation has remained about 1 mA, which implies about 1 mW, not 1 µW, to operate at peak. This is the socalled constant current scaling law of Solomon and Tang [8] . By operating at a current below the peak in a very advanced kit with a significantly higher f T , it becomes possible to lower that current substantially while still preserving the use of peak-f T current for special circuit sections that need it. Currently, the highest reported f T is 350 GHz [7] at room temperature, though theoretical predictions of 1 THz devices have been made by Harame [9] . For the work in this paper, the authors have used IBM's 130-nm 8HP design kit for which the f T of the minimum feature size SiGe HBT is 210 GHz, allowing for CPU subcomponent circuits to reach up to 20 GHz [10] - [12] . FIGURE 1. SiGe HBT cross section (from Joseph et al. [7] ).
One factor which hinders processor performance is device operating temperature. One can see from Fig. 2 electron mobility decreases with increasing temperature. When operating at relatively lower temperatures (100-200 K), lower acceptor and donor doping levels can lead to an increased mobility, but at temperatures close to normal conditions (350-400 K) and higher, doping levels become irrelevant. FIGURE 2. Electron mobility vs temperature for different doping levels [13] .
An increase in temperature of just 50 K could lead to significant slowing down of all the devices located in hot spots. For memories, this is particularly bad because the entire memory then has to slow down to its slowest memory band speed, which is located right over these hot spots.
However, solutions do exist in reducing the temperature of these hotspots. Liquid cooling of 3D chip stacks have been considered by some authors [14] - [17] , but dry methods have advantages also, particularly for the case of memory over processor 3D integration where most of the heat is at the bottom of the stack where a primary heat sink can be located.
Another factor that should be of serious concern is reliability of wires at higher temperatures due to electromigration. There are two aspects to a design that contribute to reliability without changing the materials: current density and temperature. If one scales a design to a smaller node then theoretically the current density should remain the same for the same operating frequency. However, if one is to increase operating frequency then current density will need to be increased. Given this, strategies for reducing temperature may need to be considered to improve reliability.
III. DIAMOND PROPERTIES
Single crystal diamond has the highest thermal conductivity of any solid material [18] , with a peak thermal conductivity of 2500 W/m-K at 300 K when isotopically pure [19] . Interestingly, it accomplishes this property with one of the lowest electrical conductivities known (i.e. thermal conduction is primarily by phonon transport). Consequently, the authors have studied the possibility of using such crystalline diamond films along with chemical vapor deposited (CVD) polycrystalline diamond [20] to replace some of the Si under the processor or between the processor and part of the memory stack.
CVD diamond has been deposited on wafers up to 12 inches in diameter [21] in pursuit of Silicon on Diamond (SOD) technology. Some of the successful 3D processes utilize SOI wafers for bonding and thinning [22] . Hence, the most likely migration path for incorporating diamond into the interior of the 3D stack would be by replacement of Silicon on Insulator (SOI) by SOD, which is envisioned to spread heat directly under high power density devices. SOI buried oxide is a bad conductor of heat. Although the presence of metal wires-in and vias-through the various oxide layers can mitigate this effect, it has been ignored in this calculation resulting in worst-case temperature calculations.
A wide range of vertical and horizontal thermal conductivities reported for CVD diamond depending on the exact conditions of deposition. An early result at the low end of the range was only 59-74 W/m-K [23] in the out-of-plane direction. But this result depends greatly on the method of preparation of the films, particularly growth surface treatment, with some of the best results reported by Philip et al. [24] Challenges also exist in measuring the conductivity and film thermal anisotropy with different in plane and out of plane conductivities. Microscopic measurements have found individual grains to have 2200 W/m-K [28] , close to bulk single crystal diamond. Degradation from this bulk value in-plane is ascribed to phonon scattering from grain boundaries and imperfections [29] , whereas film quality at the nucleation layer affects out of plane conductivity. The conductivity is optimal near room temperature and can degrade significantly as temperature rises, aggravated possibly by thermal expansion coefficient differences between the diamond and the substrate. Therefore, to attain the optimum spreading heat effects discussed here all other aspect of the thermal design need to target this near room temperature range.
In addition to the impact of diamond nucleation layers, even very clean interfaces between diamond and silicon, copper, or silicon dioxide exhibit Kapitza resistance [30] , [31] . This is due to phonon impedance mismatches in the THz regime between different materials. The phonon impedance match between diamond (Acoustic Impedance 42 Ns/m 3 × 10 6 ) and copper (44 Ns/m 3 × 10 6 ) is unusually good. The phonon impedance match between diamond and silicon (19.64 Ns/m 3 × 10 6 ) is not. However, mitigation of the mismatch is possible with a quarter-wave-length matching layer of Ge (28 Ns/m 3 × 10 6 ) at least at the nominal wavelength of the most prominent phonon spectral peak, which implies a thickness on the order of 4 nm.
Silicon dioxide has a low thermal conductivity, so the interface of diamond to that material is less important for heat spreading than the transmission of the heat from the silicon layers to the diamond. Heat spreading is the main strategy explored here.
IV. SoD 3D PROCESS
Silicon-on-Diamond (SoD) wafers are silicon wafers where silicon has been grown on top of a layer of diamond. This is similar to Silicon-on-Insulator wafers, except that the insulator layer, typically oxide, is now replaced by diamond. Diamond can further be deposited on top of the wafer from methane plasma [35] under appropriate conditions and at temperatures low enough to not disturb the dopant distribution in the active silicon layer.
All wafers are SoD except for the bottom wafer (tier 1). Fig. 3 shows the SoD 3D process flow. For the first FIGURE 3. SoD 3D process flow. VOLUME 3, 2015 face-to-face bonding between two wafers [36] , diamond is grown on top of the bottom wafer. A microscope then finds the bottom wafer alignment mark, with the top wafer pulled out. Then the bottom wafer is withdrawn and the top wafer alignment mark is measured. With this data, the positioning system has accuracy up to a fraction of a micron to reposition both wafers facing each other at a small distance. A set of thin separators called flags are then inserted around the rims of these two wafers and an indenting probe descends onto the back of the top wafer causing the two wafers to touch at the their nominally aligned center points. The flags are then slowly pulled out permitting the rest of the bowed wafers to roll together over their entire surface area. The two wafers are then clamped and oxide-bonded together.
Following this, the back of the top wafer is thinned to less than a micron, stopping on a silicon layer near the end of etch. The alignment marks are now visible through the thinned wafer permitting them to have masks that are aligned to the composite wafer pair and can then be used to pattern where the backside via holes will be etched out. The etching process through the composite wafer stack of Si, diamond and SiO 2 will utilize different chemistry for each layer. Once these vias are open, a metal deposition fills the vias forming wafer-to-wafer interconnects. In this manner, a two wafer stack is completed. For additional tiers, a new wafer is placed face-to-back with the composite wafer and the entire process is repeated.
Additionally, the bottom wafer in the stack can also be thinned at the end of the process and bonded to a thick diamond heat sink layer.
V. THE RISC ARCHITECTURE
The serial code accelerator is an 18 GHz 32-bit RISC processor with a targeted power of 40 -46 watts depending on what I/O circuitry is included and how much memory is provided. It has about 25,000 HBTs for the arithmetic logic unit (ALU). It has an HBT L0 small cache for data and instructions, and a BiCMOS L1 cache, with a 3D memory over processor stack to mitigate the anticipated memory wall problems, at least for the footprint of code that can fit into that memory. For the intended message-processing computer for MPICH code acceleration [2] , it is large enough. The current design is for a moderate seven-stage pipelined architecture.
The register file was identified as a potential bottleneck and was consequently pipelined between the dual port address de-coders and the word line driver to improve performance. The L0 instruction and data caches are similarly pipelined. Fabricated test results have yielded 18 GHz performance [12] and simulations have shown potentially higher performance of 24 GHz.
The 25,000 SiGe HBTs are organized into roughly 6,250 current-mode logic (CML) current trees [1] , [32] with an average power dissipation of 6.4 mW each, which converts to an average of 2.1 mA per tree; thus requiring slightly larger HBTs in some trees and smaller ones in others.
Nevertheless, the proportioned numbers of current trees and power dissipation distribution per circuit block is known. This permits a thermal analysis for the basic processor. Fig. 4 shows the floor plan for the processor and its power dissipation per block. The 40 W die size is 5 mm × 5 mm, resulting in a power density of 160 W per cm 2 . The clock is to be distributed in an H-tree configuration throughout the chip. Each buffer is separated by a 300 µm transmission line. Fig. 5 shows skew vs. temperature per a single HBT clock buffer normalized to 300 K. Simulations have indicated that temperature variances between boundaries contribute to skew by about 18 fs per K. While this may not appear to be significant, small numbers of skew over just a few degrees in temperature and with a long clock path can contribute to an alarming effect. For example, consider two variables which contribute to the overall skew: variance in temperature (or temperature difference between the split clock paths) and the number of clock buffers which follow subsequently after the split (clock depth). If the clock signal makes a split before one path enters a high temperature gradient block in Fig. 4 , it can be assumed that a clock depth of 10 or even 15 is very likely. If the heat from the block is not well distributed throughout the chip, one can expect the temperature variance to easily reach 10 K. With this insight, just a temperature variance of 5 K and a clock depth of 10 can have nearly 2% degradation on the performance of the 24 GHz clock as shown in Table 1 . On the other end, temperature variance of 15 and clock depths of 15 and 20 can cause a detrimental effect with clock degradation reaching nearly 12%. 
VI. THERMAL EVALUATION WITHOUT DIAMOND
The focus of this study is to gauge the impact of the diamond layers as heat spreaders. In doing so, first the authors establish a base case and investigate the thermal performance in state-of-the-art HBT devices. For this research, COMSOL Multiphysics thermal analysis tools [33] are used to find the steady-state temperature distribution on all processor structures using standard meshing parameters and material properties shown in Table 2 and 3, respectively. This work does not take into account temperature dependent thermal conductivity and any boundary layer contact resistance, thus contact conductance is assumed to be infinite. In pursuit to model the worst case scenario, convection and radiation effects were neglected. It should be noted that in these simulations, size effects on thermal conductivity and Kapitza resistance effects were also neglected. It also should be noted that none of the simulations carried out in this paper take into account the additional thermal conductivity that would be provided by any Cu vertical 3D TSVs, or non-architecture thermal vias, inserted for improving vertical thermal conductivity.
A. CPU
Temperature distribution analysis was conducted on the aforementioned CPU assuming an 800 µm thick die on a 1 mm Cu heat spreader to be cooled to 300 K (on the bottom surface), as illustrated in Fig. 6 . Volumetric heat generation rate was applied on the top surface at the designated locations from Fig. 4 with a thickness of 2 µm and a total heat transfer rate of 39.5 W. Fig. 7 shows the temperature distribution for the 40 W RISC processor yielding hot spots near 318 K, and a 13 K variation. One can see that if proper cooling is used [34] (i.e. the copper heat sink is maintained at 300 K) the heat dissipation would appear manageable. In this paper all temperatures shown will be relative to this assumed boundary condition. Use of less ideal heat sinks with various thermal resistances to the ambient will introduce an increase in these predictions. 
B. HBT DEVICE
However, the above analysis assumes that the heat is uniformly generated at designated locations in Fig. 4 at the CPU layer. In practice, the heat is generated in individual HBT devices which are buried in Si surrounded by SiO 2 Deep Trench Isolation (DTI) layer. SiO 2 has a much lower thermal conductivity than Si and lateral heat spreading is VOLUME 3, 2015 minimal, forcing most of heat transfer to occur through the silicon pillar. This leads to large local temperature rise at the transistor site.
This effect is illustrated by modeling an HBT as a 10 µm high stand-alone Si pillar separated from the room temperature copper plate by a 100 µm Si bulk shown in Fig. 8 . For simplicity, the silicon oxide layer surrounding the Si pillar is removed from this calculation and the pillar is assumed adiabatically insulated except for the bottom side which is in contact with bulk Si. With an emitter strip of 0.1 µm 2 and a power dissipation of 0.5 to 1 mW for the HBT, surface heat flux can be very large at 50,000-100,000 W/cm 2 . However, the temperature increase above the bulk Si is in the range of 20-30 K. Hence, the overall increase in temperature, while still in the safe range, is approaching failure values. The effects of lateral heat spreading through the silicon oxide layer as well as radiation and convection losses were neglected here, hence it is expected that overall performance is approximated reasonably well. This 30 K rise relative to the bulk can be assumed to be ''added-on'' to determine a reasonable estimation of the junction temperature of the die.
C. CPU WITH SINGLE/DOUBLE-TIER MEMORY
Analysis was made on the thermal performance of a device with 3D memory bonded on state-of-the-art HBT. The structure of the chip is shown in Fig. 9 . The memory tier contains 8 memory banks (pair at each corner) dissipating 0.375 W each. The results show the worst-case hot spot yields 322.1 K with a temperature variance of about 12 K from Fig. 10 . Silicon has a thermal conductivity of about 140 W/m-k, hence it has limited heat dissipation capabilities. One approach to improve this is thinning the bulk Si. Fig. 11 shows the proposed CPU configuration with a 100 µm Si substrate FIGURE 11. Physical dimensional structure of two 3D memory tiers on top of the CPU with a thinned 100 µm Si bulk and a 1 mm copper chuck.
and two tiers of memory. Temperature distribution of this yields hot spots of 315 K and 11 K variation, which can be seen in Fig. 12 . One can see that the heat from the hottest blocks in the processor accumulates with the memory above, spreading to some extent into the bulk Si layer. Further improving this lateral heat spreading is expected to result in a decrease in the temperature of the hot spots. Hence, a strategy suggested here is to remove part of the bulk Si layer and replace it with high thermal conductivity synthetic diamond.
VII. THERMAL EVALUATION WITH DIAMOND
Using the same conditions and assumptions from the previous section, simulations are conducted using diamond layers and further thinning of the bulk Si. By thinning and replacing part of this layer with synthetic diamond, which can have a thermal conductivity as large as 2000 W/m-K, the lateral heat spreading and the subsequent heat transfer to the copper plate is expected to improve.
A. CPU Fig. 13 shows the proposed CPU configuration with 50 µm Si substrate on 50 µm diamond. It should be noted here that synthetic diamond can vary widely and depends on the deposition method. However, representative values for lower and upper bounds for thermal conductivity are 600 and 2000W/m-K, respectively. Fig. 14 shows the computed temperature distribution yielding a maximum hot spot temperature of 306.5 K and only a 3 K variation. Note the increased temperature in the substrate in the area surrounding the heat sources, which is an indication of improved lateral heat spreading. 
B. CPU WITH SINGLE-TIER MEMORY
The next step is to inspect the performance of the 3D memory integrated with the new CPU device with the hybrid silicondiamond substrate. The schematic of the structure is shown in Fig. 15 . The upper and lower bounds for thermal conductivity of diamond was taken as 600 and 2000 W/m-K, respectively. The results are shown in Fig. 16 and 17 . The maximum temperature on the chip is now more than 10 K lower than that obtained for 3D memory implemented with the standard CPU architecture (Fig. 10) , even when the lower bound of the thermal conductivity of diamond is used in the simulation. The outline of thermal concentration of the CPU hot spots can be seen as considerably laterally dissipated in Fig. 14 as with Fig. 16 and 17 , which use a diamond thermal conductivity of 600 and 2000 W/m-K, respectively. Specifically, the highest temperature in the top memory tier FIGURE 15. Physical dimensional structure of a single 3D memory tier, 10 µm oxide, 10 µm Diamond, CPU with a thinned 50 µm Si substrate, 50 µm CVD diamond layer, and a 1 mm copper chuck. has fallen to only 307 K with a temperature variation of only 3 K. The decrease in the maximum temperature rise is attributed to the enhanced lateral heat spreading in the diamond layer.
C. CPU WITH DOUBLE-TIER MEMORY
Finally, the last case studies the integration of two 3D memory stacks with the CPU as shown in Fig. 18 . The temperature distributions of this structure are shown in Fig. 19 and 20 for diamond thermal conductivities of 600 and 2000 W/m-K, respectively. The maximum temperature increase is about 12 K. Fig. 21 , temperature distribution at the Cu-diamond boundary has been computed using a diamond thermal conductivity of 2000 W/m-K. Results yield a temperature variance of 6 K at the diamond-Cu interface.
Shown in
To explore this further, vertical slices were made near the hot spots of the stacked layers of two 3D tiers of memory and CPU, structurally detailed in Fig. 22 . Fig. 23 and 24 illustrates the temperature distribution vertically at the same hotspot location using diamond thermal conductivities of 600 and 2000 W/m-K, respectively. Temperature distributions with both diamond thermal conductivities and without diamond are plotted vs. depth from top of the stack down, which is shown in Fig. 25 . As expected, material with a lower thermal conductivity would induce a steeper slope in temperature distribution, ultimately resulting in a higher peak temperature. Even with diamond at its worst-case thermal conductivity, peak operating temperature still perform closer to ideal cases of diamond thermal conductivity. 
VIII. ANALYSIS OF RESULTS
Simulation results are tabulated in table 4 with and without diamond. Peak temperatures include the 30 K increase due to the HBT device. One can see that by adding a tier of memory, peak temperatures appear to increase linearly. This suggests that a CPU with two tiers of memory and without diamond is likely to have a peak temperature of 356 K (83 • C), and by aggressively thinning the Si substrate, the peak temperature can be reduced by 11 K to 345 K.
The most important comparison in the results is the temperature variance reduction from the use of diamond. While the variance remains roughly the same per additional stack of memory, variance has reduced by roughly 10 K with the use of diamond. With the addition of performance increase due to decreased peak temperature, reduction in temperature variance can be predicted to improve performance by as much as 8%. 
IX. FUTURE WORK
The present results are limited by the 20 GB of memory used in the computations. It is hoped that up to 16 layers of memory over the CPU will eventually be analyzed. Further work on thermally induced stress will be examined. Additional modeling of the impact of vertical vias and details of chilled liquid cooling will be explored. Finally, revisiting the analysis in IBM's 90-nm 9HP generation will be even more attractive.
X. CONCLUSION
After previously exploring the feasibility of 3D chip stack consisting of high speed CPU and stacked DRAM, concerns were raised regarding heat removal when operating at high frequencies. Such concerns lead to processor clock frequency degradation due to reduced mobility and temperature dependent clock skew. With the use of COMSOL finite element analysis tools, diamond has proven to be potentially an effective heat spreader in a processor-memory 3D stack. Simulations have indicated an 11 K peak temperature reduction and a 10 K temperature variance reduction in the proposed CPU structure. This ultimately has led to an increase clock performance by as much as 8% due to reduced clock skew alone.
ACKNOWLEDGMENT
Thanks are also given to Jerry Zimmer and Dwain Aldala at SP3, and to Professors Masashi Yamaguchi and Toh-Ming Lu at Rensselaer for extensive communications concerning Kapitza resistance mitigation. Finally, the authors would like to thank Rudolf Haring and Shurong Tian at IBM for their sight and feedback. The views presented in this paper are those of the authors. Dr. Kraft is a member of the Eta Kappa Nu, the Tau Beta Pi, the Sigma Xi, and the Machine Vision Association of SME. He is a technical reviewer of the IEEE TRANSACTIONS ON COMPONENTS, PACKAGING, AND MANUFACTURING TECHNOLOGY. He has two patents in computer vision for noncontact gauging, and has co-authored several publications in high-speed digital design with SiGe HBTs, vision inspection, phased array design, homomorphic signal processing, and control system design. 
