Recent application and technology trends bring a renaissance of the processing-in-memory (PIM), which was envisioned decades ago. In particular, die-stacking and silicon interposer technologies enable the integration of memory, PIMs, and the host CPU in a single chip. Yet the integration substantially increases system power density. This can impose substantial thermal challenges to the feasibility of such systems. In this paper, we comprehensively study the thermal feasibility of integrated systems consisting of the host CPU, die-stacking DRAMs, and various types of PIMs. Compared with most previous thermal studies that only focus on the memory stack, we investigate the thermal distribution of the whole processor-memory system. Furthermore, we examine the feasibility of various cooling solutions and feasible scale of various PIM designs under given thermal and area constraints. Finally, we demonstrate system run-time thermal feasibility by executing two high-performance computing applications with PIM-based systems. Based on our experimental studies, we reveal a set of thermal implications for PIM-based system design and configuration.
INTRODUCTION
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. Processing-in-memory (PIM), also known as near-memory computing or near-data processing, builds on the basic idea of integrating computation directly in memory devices [31, 14, 17, 53, 37, 21, 11, 36, 50, 52] . After decades of dormancy, it re-emerges in a new form due to recent application and technology trends. On the application front, inmemory databases [57, 60] , web-scale applications [47, 15] , high-performance computing [54, 10, 30, 62] , and in-situ data processing such as scientific visualization for real-time analysis [38] manipulate increasingly large volumes of data in memory. Data movement between CPU and memory is becoming one of major contributors to system energy consumption and performance degradation [64, 3, 46] . This motivates the demand for moving computation close to memory, where working data is located. On the technology front, recent advances of 3D-stacked memory [48, 12, 34, 67, 68, 9] enable the stacking of a logic (silicon) die implemented by a high-performance technology process with one or more memory (e.g., DRAM) layers. The logic die offers sufficient silicon area and performance capability to implement various logic and computation functions, such as adders, memory copiers, CPU cores, and GPUs [13, 4] . Recent studies [19, 1, 2, 32, 51, 33, 65, 20, 4, 13, 65, 33] demonstrate that such integration technologies is likely to enable PIM in a practical manner.
One major concern in adopting PIMs with die-stacking memory is thermal feasibility. Prior studies demonstrated the thermal feasibility of integrating programmable PIMs with 3D-stacked memory [13] . However, most previous related work only focuses on studying the thermal issues of PIM-based memory stack itself; the thermal feasibility of the integrated system -the host CPU and the memory stack -remains largely unknown. Die-stacking memories are typically integrated with a host CPU on a silicon interposer [61, 35] in a single chip. This effectively reduces the footprint of processor-memory system, but also increases its density. As such, the integration of memory, PIMs, and the host CPU can significantly increase system power density and impede heat dissipation, reducing the thermal feasibility of PIM-based designs. Studying the memory stack alone is insufficient to understand the thermal feasibility of the integrated system.
The thermal interaction between the integrated host CPU and the memory stack can intricate the thermal analysis. Host CPU heat dissipation can heavily impact the temperature of the memory stack. With much higher logic density than the memory stack, the host CPU can dissipate much more heat than the memory stack. Therefore, instead of heated by itself, the memory stack may be heated up by the host CPU.
On the flip side, the memory stack can also impact thermal constraint of the host CPU. Most high-end CPUs used by data-intensive applications can tolerate over 100°C [25] , which is much higher than DRAM's typical operating temperature range (typically under 85°C). As a result, the integration of the memory stack can tighten the thermal constraint of the processor chip 1 . The goal in this paper is to investigate the thermal feasibility of the system that consists of the host CPU and the PIM-based memory stack. Toward this end, we perform a comprehensive thermal analysis with a variety of system configurations and applications. First, we demonstrate that large-scale programmable PIMs require commodity-server or high-end active cooling solutions (Table 4) , even though we only consider the stand-alone memory stack without the thermal impact of the host CPU. Second, we investigate the thermal interaction between the host CPU and the PIM-based memory stack as a function of the distance between the two. Third, we explore the PIM design space by stretching the scale of a variety types of PIMs under given thermal and area budgets. Finally, we demonstrate the thermal feasibility of PIM-based systems by evaluating the run-time thermal distribution with two high-performance applications. This paper makes the following contributions.
• We comprehensively investigate thermal feasibility of the entire integrated system consisting of the host CPU and PIM-based memory stack with various cooling solutions.
• We demonstrate the thermal interaction between the host CPU and the memory stack, and its impact on system thermal constraint and feasible cooling solutions.
• We explore the PIM design space under certain thermal and area constraints, and identify the key constraints to the scale of various types of PIMs.
• Based on our experimental study, we reveal a set of thermal implications for PIM-based system design and configuration.
BACKGROUND AND MOTIVATION
Thermal analysis for the integrated host CPU and memory as a system is essential to demonstrate the feasibility of PIM. However, this thermal analysis is challenging due to the complex interaction among system components. This section describes background on modern PIM techniques, processor-memory integration, and thermal management. We also motivate our thermal study by investigating the thermal constraints and challenges in PIM-based systems.
Processing In Die-Stacking Memory
Since its debut in 90s, PIM has been explored as a promising solution to accelerate data processing and improve host CPU utilization. But traditional PIM techniques were not widely adopted due to their cost, design complexity, and use case limitations. The interest in PIM is reignited due to recent application and technology advancement. Application Requirement. Many modern applications process large and heterogeneous data sets stored in memory with increasingly large capacity. Data movement between memory and CPU is becoming a critical system performance and energy bottleneck. In addition, a large portion of data processing merely involve simple arithmetic, data movement, and data duplication operations. These do not require complex logic and computation power offered by CPU. Technology Feasibility of PIM. Modern PIM techniques may be built with various emerging technologies and architectures, such as 3D die-stacked memory [48, 22] , nonvolatile memory [8] , and automata processor [23] . We study on a die-stacking-memory-based PIM design, which is one of the most promising PIM approaches being explored in both academia and industry [13, 4] . Today, die-stacking memories can stack several memory dies on top of a logic die. The logic die can accommodate sophisticated logic and computation functionality, as long as it employs true logic process. One such example is hybrid memory cube (HMC)-style die-stacking memories [48, 22] . Whereas initial versions of HMC merely offer simple logic at the base (e.g., NoC, I/O drivers, memory controllers, and simple arithmetic and atomic functions) with large process node, its logic die has plenty of area to accommodate processor cores and accelerators by adopting recent process nodes [48, 22] 2 .
Processor-Memory Integration. 3D die-stacked memories are electrically connected with the host CPU (also referred to as the host processor) side-by-side using a silicon interposer [61, 35, 56] (Figure 1 ). Silicon interposer is a passive substrate with through-silicon vias (TSVs) [61] and wires that interconnect multiple dies sitting side by side, which is often referred to as 2.5D integration. As such, systems integrated with host CPU and 3D-stacked memory are referred to as 2.5D+3D integrated systems. Such technology provides orders of magnitude denser interconnections between the integrated host CPU and memory stack. It also dramatically reduces the distance among host CPU, memory, and PIMs, significantly increasing the density of logic and memory components. Most previous work focuses on investigating the thermal feasibility of implementing PIMs in memory stack [13, 55, 42, 49] . However, when the memory stack is integrated with the host CPU in a single package; the thermal feasibility of the processor-memory PIM system remains largely unexplored. In particular, the thermal dissipation of host CPU, memory, and PIMs can interact with each other, making the thermal profile of the whole system not straightforward.
Thermal Issues with
The host CPU performs compute-intensive operations with large-scale, power-hungry logic and arithmetic components. The host CPU can operate at a much higher temperature than memory. Therefore, the heat dissipation of the host CPU can substantially impact the temperature of the memory stack, while the heat generated by the memory stack itself may be less significant. In addition, increasing the distance between the host CPU and the memory stack can mitigate such thermal impact. Therefore, our study evaluates the thermal impact of the host CPU to the memory stack as a function of the distance between the two.
The memory stack can also impact the thermal constraint of the host CPU due to the lower operation temperature range of memory. JEDEC stipulates that systems running beyond 85°C need to double memory self-refresh rate [28] ; the rate continues to double for every~10°C degree beyond 85°C˜ [41] . Doubling the refresh rate incurs much higher energy and performance overhead, hence is especially undesirable. As a result, the processor-memory system needs to maintain a temperature lower than 85°C even though the host CPU can tolerate much higher temperature [25] . Recent work [44, 45] shows that it is critical to study the thermal feasibility of the processor-memory system rather than the memory stack alone; yet these studies did not investigate the scenarios with PIMs in the logic layer. As such, we investigate the thermal impact of such thermal constraint to the feasibility of PIMbased system design.
SYSTEM CONFIGURATIONS AND MODELING
We study a system that consists of a host CPU and a memory stack integrated on a silicon interposer, as depicted in Figure 1 . The memory stack consists of DRAM dies and a logic die that incorporates PIMs. We investigate the thermal feasibility of the system with various types and configurations of PIMs. While a processor-memory integrated system can have multiple memory stacks, this work focuses on studying systems with one memory stack. Multiple memory stacks can interact with the host CPU in similar manners and therefore will have the similar thermal feasibility as the case of one memory stack. We leave the investigation of multiple memory stacks as future work.
Host CPU Configurations
We model the host CPU with an architecture similar to Intel Xeon processors with four cores. Table 1 lists detailed architecture configurations and modeling parameters of the host CPU.
Memory Stack Configurations
We model an HMC-style memory stack, which has eight DRAM layers stacked on top of a logic die. The memory stack is divided into 16 vertical slices, also called vaults as shown in Figure 1 (b). Each vault has its own independent TSV bus and vault controller [48] . This allows each vault to operate in parallel similar to independent channels operating in conventional DRAM-based memory systems [4] . DRAM Die Configurations. The memory stack has a total capacity of 4GB, which is similar to the latest implementations of die-stacking memory [27, 13, 48] . The HMC-style memory can adopt either DDR3 or DDR4 DRAMs. We employ DDR4, which is the latest DRAM technique. Logic Die Configurations. We assume that the area of the logic die matches that of the DRAM dies. This is in line with the HMC designs [48, 22] . While this is not a hard constraint for the memory stack design, doing so can reduce manufacturing and vertical routing complexity. In addition, this area is sufficient to accommodate various PIMs, because PIM is intended to supplement the host CPU rather than implement full processing capability [13] . As shown in Figure 1 (c), the [16, 21, 59] .
In this work, we investigate the thermal feasibility of integrating 1) only fixed-function PIMs, 2) only programmable PIMs, and 3) heterogeneous PIMs that consist of both types. With fixed-function PIMs, we model various simple logic and arithmetic functions, e.g., adder, multiplier, AND, OR, XOR, shifting, dot product, memory copier, memory mover, compare-and-swap, fetch-and-add, test-and-set, sorting, and scatter-gather. Our power and area modeling (Section 3.3) shows that they can be classified into two categories -simple PIM and complex fixed-function PIM. A typical simple fixed-function PIM can contain a multiplier, a divider, or a combination of an adder, a shifter, and a logical unit; a complex fixed-function PIM can contain floating point units or multi-functional logic and arithmetic functions. Our thermal analysis abstracts these fixed-function PIMs as either simple or complex ones without distinguishing among individual functions. With programmable PIMs, we model in-order processing unites (PUs) distributed across the 16 vaults. Each PU is modeled as an ARM Cortex-A9 core with a 2.0 GHz clock rate, but modified to have an in-order pipeline. Each PU is placed next to its home vault router to reduce routing complexity and improve performance.
Previous work shows that neither type of PIMs is an obvious performance winner, given the variety of applications that are likely to benefit from PIMs [19, 1, 2, 32, 51, 33, 65, 20, 4, 13, 65, 33] . Therefore, we also examine the feasibility of a heterogeneous PIM design, which integrates both fixed-function and programmable PIM in the logic layer.
We refer the logic die components other than the PUs as fixed-function units, because most of them offer fixed functionality. These fixed-function units can include vault controllers, NoC, and fixed-function PIMs (if any). We also refer the fixed-function units in each vault as a fixed-function unit group (FFUG). [26] . Table 2 illustrates the total maximum power and power breakdown of each DRAM layer. We calculate the area and power (leakage and peak dynamic) of the programmable PIM PUs using McPAT [39] . Each PU has an area of 1 mm 2 and 0.96 W peak power. We estimate the power and area of various types of fixed-function logic and computation units using Synopsys Design Compiler. To simplify thermal analysis, we categorize them into two classes: simple fixed-function PIMs have peak dynamic power and area below 0.01 W and 0.06mm 2 , respectively; complex ones have peak dynamic power and area larger than 0.015 W and 0.25 mm 2 . In the case of heterogeneous PIMs, the area budget left for FFUG (fixed-function PIMs, vault controller, and interconnects) in each vault is 3.25 mm 2 (4.25 mm 2 per vault excluding a 1 mm 2 PU). In particular, Figure 2 illustrates the floorplan of an example logic layer with heterogeneous PIMs, where CORE1-16 represent PUs and FIX1-16 represent FFUGs.
Modeling and Evaluation Methodology
We also model run-time dynamic power in order to investigate the system dynamic thermal distribution, when we execute various data-intensive applications. To model dynamic power of the host CPU, vault controllers, and PUs, we feed performance statistics into McPAT [39] power model. The performance statistics are obtained by gem5 [7] simulation of our evaluated workloads. Dynamic power of fixedfunction PIMs are estimated with the activity ratio based on the simulation. Thermal Modeling. We use HotSpot 6.0 [66] to analyze thermal distribution of the processor-memory system with FIX14  CORE14  FIX15  CORE15  FIX16  CORE16   FIX9  CORE9  FIX10  CORE10  FIX11  CORE11  FIX12  CORE12   FIX5  CORE5  FIX6  CORE6  FIX7  CORE7  FIX8  CORE8   FIX1  CORE1  FIX2  CORE2  FIX3  CORE3 [58, 45, 42] various PIM designs. To evaluate system thermal distribution, we draw floorplans according to the die photos of corresponding Xeon CPU and ARM cores. We then generate thermal maps by feeding the floorplans, power traces, and thermal configurations of system components into HotSpot [66] . Table 3 lists the key parameters in our memory stack thermal modeling.
We evaluate both peak steady-state and dynamic thermal distributions: the former can implicate thermal constraints introduced by various PIM designs; the latter shows the thermal feasibility of running various applications with PIMs. To evaluate peak steady-state thermal distribution, we employ the leakage and peak dynamic power power obtained from McPAT [39] to generate power traces with various PIM designs. To evaluate dynamic thermal distributions, we simulate two HPC workloads with gem5 simulator [7] and obtain performance statistics of various program phases. We annotate the application source code to identify the program phases. We then generate run-time power traces by feeding the performance statistics into McPAT [39] .
We evaluate various cooling solutions to investigate the cost of required cooling to meet the system thermal constraint. Table 4 lists four cooling solutions employed in our study. They all employ heat sinks. The passive cooling only adopts a heat sink; the three active cooling solutions adopt heat sinks plus cooling fans with various cost-performance trade-offs as illustrated in Table 4 . Among them, the low-end active cooling adopts inexpensive consumer-level heat sinks and fans [13] . The convection thermal resistance values are calculated from specific cooling solutions. The costs of these cooling solutions are collected from various vendors.
Similar to previous studies [13] , to simplify the thermal modeling and analysis, we only studies the logic die components that can have significant thermal impacts, such as PUs, fixed-function PIM execution units, and buffers. Other hardware resources (e.g., memory controllers) have relatively low thermal impacts. Therefore, we do not model them in detail, yet leave constant power budget for them in our thermal analysis.
THERMAL FEASIBILITY ANALYSIS

Memory Stack Only Analysis
We first investigate thermal distribution of the memory stack (no host CPU in the package) as a baseline. We examine the cases with both fixed and various ambient temperatures.
Our first set of experiments use constant ambient temperature (45°C a typical ambient temperature in computer boxes [58, 45, 42] ), and evaluate system temperature with various cooling solutions. Figure 3 shows the thermal map with an active cooling solution used for high-end servers. Note that the temperature 86.89°C shown on the right-hand side of Figure 3 is the peak temperature of the whole memory stack. Yet only the peak temperature of DRAM dies matters to thermal feasibility, because DRAM needs to operate at lower than 85°C [28] . Therefore we also analyze peak DRAM temperature based on HotSpot text output (not shown in the thermal map). Our evaluation results demonstrate that the peak power of the logic die cannot exceed 5.16 W per vault, in order to maintain DRAM temperature below 85°C. This power budget is sufficient for accommodating high-end programmable PIM computing capabilities. Similarly, we also evaluate the impact of other cooling solutions. Commodity-server active cooling solutions can sustain up to 3 W per vault in the logic layer without violating the thermal constraint of DRAMs. This power budget is sufficient for accommodating the programmable PIMs and/or a large number of fixed-function PIMs. We will investigate the feasible number of fixed-function PIMs in Section 4.3. However, neither passive nor low-end active cooling solutions can secure thermal feasibility for memory stacks with 16 programmable PIM PUs, which is required to accelerate data-intensive applications in some existing work [1, 4] . The peak temperature of DRAM dies can exceed 85°C, even FIX2  CORE2  FIX3  CORE3  FIX4  CORE4   FIX5  CORE5  FIX6  CORE6  FIX7  CORE7  FIX8  CORE8   FIX9  CORE9  FIX10  CORE10  FIX11  CORE11  FIX12  CORE12   FIX13  CORE13  FIX14  CORE14  FIX15  CORE15  FIX16  CORE16 86 if only 16 PUs are running and the rest of the logic layer is idle. Power Budget of FFUGs. Fixed ambient temperature is not always the case. To examine the impact of various ambient temperatures to the logic die power budget, we evaluate the thermal distribution by varying the ambient temperatures from 25°C to 70°C across various cooling solutions. Figure  4 illustrates our results. Not surprisingly, the higher the ambient temperature, the lower the power budget will be. We do not show the power budget of each FFUG with passive cooling solutions, because the power budget is either very low (<0.1 W) or zero. The figure also shows that low-end active cooling solutions are infeasible, when ambient temperature exceeds 40°C. An ambient temperature of 70°C results in a very low power budget for FFUGs, even with active cooling solutions at the grade of commodity or high-end server. We investigate the thermal interaction between the host CPU and the memory stack, when they are integrated on a silicon interposer in the same package. The physical proximity of the host CPUs and the memory stack can exacerbate thermal issues and increase system power. As such, we expect to see an increase in temperatures of both the host CPU and the memory stack when placing them close to each other.
Thermal Interaction Between the Host CPU and the Memory Stack Integrated with PIMs
So far, HotSpot [66] does not support the 2.5D+3D integration. Therefore, we modify HotSpot [66] to model our processor-memory system in a way shown in Figure 5 . Layer 0 is the interposer. Because it is a passive layer, we assume its power is zero or negligible. We place both the host CPU and the logic die of the memory stack in Layer 1; the rest regions of the layer are dummy components, which are modeled as air and consume zero power. Layers 2 to 9 consist of DRAM dies and dummy components. We add these dummy components in each layer, because HotSpot [66] requires that the dimensions of all layers in a 3D stack need to be identical. Thermal Impact of the Host CPU to the Memory Stack as a Function of Distance. To investigate the thermal impact of the host CPU to the memory stack, we explore the scenario when only the host processor is active and the memory stack is idle. Again, we set the ambient temperature to be 45°C and employ high-end server active cooling solutions. Originally, we set the distance between the host processor and the memory stack to be 10 mm. The 10 mm distance is impractical, but sufficiently long to illustrate the low temperature coupling between the host CPU and the memory stack [69] . As shown in Figure 6 , although the host processor only occupies a portion of the package, it increases temperature of the memory stack to be above the ambient temperature. As a result, the "effective ambient temperature" in the package rises to ⇠72°C. Peak temperatures of the host processor and DRAMs are 85.71°C and 72.63°C , respectively. Therefore, the evaluated system configuration is feasible in terms of thermal, when the memory stack is idle.
To further explore the thermal impact of the host CPU, we vary the distance between the host CPU and the memory stack (still in idle) and examine the peak DRAM temperature. As shown in Figure 7 , we reduce peak DRAM temperature from 75.23°C to 72.63°C by increasing the distance from 1 mm to 10 mm. In addition, we observe that the declining rate is close to a linear rate (slightly slower than the linear rate). Therefore, the thermal impact of the host CPU host   FIX1  CORE1  FIX2  CORE2  FIX3  CORE3  FIX4  CORE4   FIX5  CORE5  FIX6  CORE6  FIX7  CORE7  FIX8  CORE8   FIX9  CORE9  FIX10  CORE10  FIX11  CORE11  FIX12  CORE12   FIX13  CORE13  FIX14  CORE14  FIX15  CORE15  FIX16  CORE16 85 is roughly a linear function of the distance between the host CPU and the memory stack. However, after this distance is longer than 15 mm, the thermal impact tends to be much stabler when the distance further increases. Thermal Interaction Between the Host CPU and the Memory Stack. To study the thermal interaction, we make the memory stack active (in normal operating state) too. We set the distance between the host processor and the memory stack to be 10 mm. We fix the power of FFUG (with or without fixed-function PIMs) to be 1 W, which is similar to the peak power of a PU. Figure 8 shows a thermal map in this scenario. Comparing with Figure 6 , we observe the thermal interaction between the host CPU and the memory stack, while the thermal impact of the host CPU to the memory stack dominates. The peak temperature of the host processor increases by 6.48°C , from 85.71°C to 92.19°C . The peak temperature of DRAMs is 88.48°C , which is increased by 13.25°C . In this case, the DRAM temperature exceeds the 85°C thermal constraint, even though the power of FFUG is only 1 W -we need to further reduce FFUG power budget to make the system design feasible. On the flip side, the memory stack constraints the choice of cooling solutions. Most server CPUs can tolerate the 92.19°C with passive or low-end active cooling solutions. However, the integration of the host CPU with memory determines that host   FIX1  CORE1  FIX2  CORE2  FIX3  CORE3  FIX4  CORE4   FIX5  CORE5  FIX6  CORE6  FIX7  CORE7  FIX8  CORE8   FIX9  CORE9  FIX10  CORE10  FIX11  CORE11  FIX12  CORE12   FIX13  CORE13  FIX14  CORE14  FIX15  CORE15  FIX16 high-end active cooling solutions must be used. Power Budget of FFUGs. In order to examine the feasible power budget of each FFUG, we keep the distance between the host CPU and memory stack to be 10 mm yet vary the ambient temperature and cooling solutions. With the above configuration, all of the cooling solutions except the active cooler for high-end servers cannot control the peak temperature of DRAM under 85°C. Figure 9 shows the results for the power budgets with the active cooler for highend servers.
Feasible Scale of PIMs Under Given Thermal and Area Budgets
To explore the design space of PIMs, we study the impact of thermal and area constraints on the scale of programmable, fixed-function, and heterogeneous PIMs. The maximum scale of PIMs need to meet either thermal or area constraint, whichever is tighter. Scale of Programmable PIMs. Given 68 mm 2 as the total area of the logic die, each vault has a 4.25 mm 2 area budget if we evenly divide the logic die among 16 vaults. This area budget can only fit up to four PUs in each vault, assuming no FFUGs -this is impractical, but provides an extreme case for the scale of programmable PIMs. We investigate the thermal distribution with our previous experimental deployment: 45°C ambient temperature, the active cooling solutions for high-end servers, and 10 mm distance between host processor and the memory stack. Our results indicate that we need to maintain the peak power of each PU below 0.4 W in order to avoid violating the thermal constraint of 85°C. Scales of Fixed-Function and Heterogeneous PIMs. Table 5 illustrates the maximum feasible scales of fixed-function and heterogeneous PIMs in the whole logic die. With the heterogeneous PIM, we assume that each vault has only one PU. The table also shows a tighter constraint between area and thermal budgets. Determined by area and peak power, a typical simple fixed-function PIM can contain a multiplier, a divider, or a combination of an adder, a shifter, and a logical unit. Complex fixed-function PIMs can contain floating point units or multi-functional logic/arithmetic functions. In addition to simple and complex fixed-function PIMs, we also calculate the feasible scale for heterogeneous PIMs with several representative functions, such as in-memory data movement and atomic operations.
The result shows that the scale of fixed-function PIMs is subject to the area constraint. up to 1168 simple fixedfunction PIMs can fit in the logic layer without violating any constraints. But further increasing the number can violate the area constraint. With larger area per PIM, only up to 224 complex fixed-function PIMs can be placed in the logic layer before violating the area constraint. To study the design space of heterogeneous PIMs, we first place one PU in each vault in the logic die. We then place as many fixed-function PIMs as possible until either the thermal/power constraint or the area constraint is broken. Our results show that the scale of heterogeneous PIMs is most likely subject to the thermal/power constraint, because the PUs consume a large portion of the power budget. Only heterogeneous PIMs with complex fixed-function units are constrained by area, because complex fixed-function units, such as floating-point units, have relatively larger area overhead.
Run-time Thermal Analysis
In this section, we evaluate the thermal feasibility of PIM designs by executing data-intensive applications. The ambient temperature (45°C ), cooling solution (active cooling used for high-end servers), and the distance between the host CPU and the memory stack (10 mm) are set as the same as in previous experiments. In addition, we employ a detailed floorplan of the host CPU, with details of each CPU core and L1/L2/L3 caches. We modify gem5 simulator [7] to collect performance statistics for each program phase on our processor-memory system with PIMs. The performance statistics include instruction count, memory access, cache misses, on our processor-memory system with PIMs. We then feed the performance statistics into McPAT [40] to estimate system dynamic power as described in Section 3.3. Workload Characteristics. We investigate CG and MG workloads from the NAS parallel benchmark suite [6] .
CG uses the inverse power method to find an estimate of the largest eigenvalue of a symmetric positive definite sparse matrix with a random pattern of non-zeros. The computation of CG is dominated by a multiplication-addition operation l3   l21   host_core1   l22   host_core2   l23   host_core3   l24   host_core4   FIX1  CORE1  FIX2  CORE2  FIX3  CORE3  FIX4  CORE4   FIX5  CORE5  FIX6  CORE6  FIX7  CORE7  FIX8  CORE8   FIX9  CORE9  FIX10  CORE10  FIX11  CORE11  FIX12  CORE12   FIX13  CORE13  FIX14  CORE14  FIX15  CORE15  FIX16  CORE16 79 represented as a = b + c ⇤ d. In many cases, a, b, c and/or d in the operation are the elements of specific vectors or matrices. These memory accesses come from indirect data references. These memory accesses can be random and have poor data locality. The memory access pattern of CG with indirect data references is because of the compressed row storage (CRS) format for storing sparse vectors/matrices. The memory access pattern with indirect data references is common in sparse linear algebra. Because of the poor data locality in this memory access pattern, the traditional CPU-based computation can cause lots of cache misses and frequent data movement between CPU and main memory. For CG to use fixed-function PIMs, we offload the primitive multiplicationaddition operations to the PIMs. To use the programmable PIM, we offload the most computation-intensive loop (particularly the one in the conj_grad routine). MG approximates the solution of a three-dimensional discrete Poisson equation using the V-cycle multi-grid method on a rectangular domain with periodic boundary conditions. In the V cycle, the computation starts from the finest refinement level, going down level by level toward the bottom, then back up to the top. The V-cycle multi-grid method involves applying a set of stencil operations sequentially on the grids at each level of refinement [63] . The stencil operations happen in various execution phases, including restriction, prolongation, evaluation of residual, and point relaxation. The stencil operations in MG often involve a 4-point stencil. To use fixed-function PIMs, we offload these stencil operations to PIMs. To use the programmable PIM, we offload the major computation routines (particularly mg3P and resid). Thermal Analysis Results. We perform four groups of experiments with these workloads with heterogeneous PIMs.
The first experiment group is a baseline, where we make all the PIMs idle and let the host CPU execute the workloads. Figure 10 and Figure 11 demonstrate the results of running CG and MG, respectively. The peak DRAM runtime temperatures achieved during CG and MG execution are 65.51°C and 68.28°C, respectively. They are both be- FIX1  CORE1  FIX2  CORE2  FIX3  CORE3  FIX4  CORE4   FIX5  CORE5  FIX6  CORE6  FIX7  CORE7  FIX8  CORE8   FIX9  CORE9  FIX10  CORE10  FIX11  CORE11  FIX12  CORE12   FIX13  CORE13  FIX14  CORE14  FIX15  CORE15  FIX16  CORE16 83 low the 85°C thermal constraint. We observe that the runtime dynamic power of the host CPU is also lower than its nominal TDP. This is reasonable, because TDP refers to the power that generates the maximum amount of heat that the cooling system is required to dissipate in typical operations. Thus, the thermal effect of the host CPU to the memory stack is weaker than that when the power of the host CPU is close to its TDP. In the second experiment group, we disable all fixed-function PIMs and only allow the host CPU to offload operations to the programmable PIMs. Figure 12 and Figure 13 show the result of executing CG and MG, respectively. The peak FIX1  CORE1  FIX2  CORE2  FIX3  CORE3  FIX4  CORE4   FIX5  CORE5  FIX6  CORE6  FIX7  CORE7  FIX8  CORE8   FIX9  CORE9  FIX10  CORE10  FIX11  CORE11  FIX12  CORE12   FIX13  CORE13  FIX14  CORE14  FIX15  CORE15  FIX16  CORE16 83 DRAM temperatures during CG and MG execution are 69.16°C and 68.78°C, respectively. They are also lower than the 85°C thermal constraint. Therefore, using programmable PIMs to accelerate CG and MG is feasible in terms of thermal. Furthermore, we make three additional observations by comparing the results with the baseline. First, the peak temperature of the host CPU when running CG with the programmable cores is higher than that when running CG with the host CPU alone. This is due to the different thermal interactions between the host CPU and the memory stack discussed previously. Second, the peak temperature of the host CPU when running MG with the programmable PIMs   l3   l21   host_core1   l22   host_core2   l23   host_core3   l24   host_core4   FIX1  CORE1  FIX2  CORE2  FIX3  CORE3  FIX4  CORE4   FIX5  CORE5  FIX6  CORE6  FIX7  CORE7  FIX8  CORE8   FIX9  CORE9  FIX10  CORE10  FIX11  CORE11  FIX12  CORE12   FIX13  CORE13  FIX14  CORE14  FIX15  CORE15  FIX16 CORE1  FIX2  CORE2  FIX3  CORE3  FIX4  CORE4   FIX5  CORE5  FIX6  CORE6  FIX7  CORE7  FIX8  CORE8   FIX9  CORE9  FIX10  CORE10  FIX11  CORE11  FIX12  CORE12   FIX13  CORE13  FIX14  CORE14  FIX15  CORE15  FIX16  CORE16 85 is slightly lower than that when running MG with the host CPU alone. This is the result of two contradicting effects. On one hand, the thermal interaction between the host CPU and the memory stack increases the peak temperature of the host CPU. On the other hand, because power-hungry operations are offloaded to the PIM, the peak temperature of the host CPU is reduced. For MG, the thermal reduction due to operation offloading offsets the bad thermal effects of the memory stack on CPU. Third, because the run-time power of the programmable PIM cores is much lower than their peak power, the PIM does not seem to have high temperature in the figures.
In the third experiment group, we disable all programmable PIM PUs and only allow the host CPU to offload operations to fixed-function PIMs. The results are shown in Figure 14 and Figure 15 . The peak DRAM temperatures during CG and MG execution are 68.78°C and 68.53°C, respectively. They are still below the 85°C thermal constraint and therefore render the feasibility of using fixed-function PIMs to accelerate both workloads.
In the last experiment group, we enable all 16 programmable PIM PUs and fixed-function PIMs, which form a heterogeneous PIM configuration. The results are shown in Figure 16 and Figure 17 . The peak DRAM temperatures during CG and MG executions are 69.59°C and 69.74°C, respectively. They are still lower than the 85°C thermal constraint. That said, heterogeneous PIMs are also feasible in terms of thermal.
CONCLUSIONS
PIM techniques are re-emerging in a new form, due to recent technology and application advancement. Yet the thermal constraint can be one caveat of adopting PIM. This paper investigates the thermal feasibility of integrating PIMs in the logic die of 3D-stacked memory. Different from most previous work that studies the thermal of the standalone memory stack, we examine the thermal of the whole system consisting of the host CPU and the memory stack. With comprehensive thermal analysis, we provide the following thermal implications to PIM-based system design and configuration:
1. Even considering a standalone memory stack, the stack with large-scale programmable PIMs requires high-end or commodity-server cooling solutions to accommodate to their peak thermal dissipation.
