ABSTRACT In this paper, we proposed thermal-and performance-aware address mapping (TPAMAP) for the multi-channel 3-D DRAM systems. TPAMAP reduces the thermal problem from the vertical stacking of the active banks in different DRAM channels and supports the mappings with the design tradeoffs of the temperature and performance. In our experiments, the peak temperature can be reduced by 1.1 • C ∼12.3 • C for the 3-D DRAM system using TPAMAP. The cost function for the constraints of the temperature and performance is also proposed. In hardware design, we demonstrate the low-cost feature of TPAMAP in our hardware design.
I. INTRODUCTION
THE three-dimensional integrated circuits (3D ICs) have been proposed to achieve better performance than the traditional two-dimensional (2D) chip. 3D ICs apply Through-Silicon Vias (TSVs) technology to provide the vertical connectivity and stack multiple heterogeneous intellectual properties (IPs) in a chip. The three-dimensional dynamic random access memory (3D DRAM) can be also realized by using TSVs to stack multiple DRAM dies in a 3D chip. The 3D DRAM can overcome the traditional bandwidth limitation in the 2D DRAM [1] .
In recent works, different 3D DRAM architectures are discussed to solve the aforementioned bandwidth problem, such as Wide I/O [16] , [20] , [21] , Hybrid Memory Cube (HMC) [5] , [17] , and 3D stacked DDRx/LPDDRx DRAMs [18] , [19] , [3] . In Wide I/O [16] , [20] , [21] , the single data rate (SDR) DRAM with 4 parallel channels and 512 I/O pins is designed to improve the limited I/O pins in the traditional 2D DRAM. In HMC [17] , multiple DRAM dies are stacked on the top of a logic die to form the 3D DRAM architecture. By the high-speed TSVs, the controller in the logic die can support the parallel access of the DRAM dies in the vertical directions. In [3] , [18] , [19] , multiple LPDDR3 and DDR2 DRAM dies are stacked to achieve higher bandwidth.
Although the 3D DRAM has the aforementioned advantages, it also suffers the thermal problems from the stacking of the ICs [2] . The stacking of the DRAM dies in the 3D DRAM results in high power density and poor heat dissipation [6] , [7] . To solve the thermal problems, many thermal management schemes were proposed in the literature [3] - [7] , [13] , [14] , [5] , [21] - [24] , [27] , [28] .
In the 3D DRAM system, the memory bandwidth is improved by multiple channels and wide interface [8] , [11] , [12] . Different controller designs were proposed to support the 3D DRAM systems, such as the centralized controller [8] , [25] , [26] and distributed controller [8] , [5] , [22] . The centralized controller is designed to handle the memory accesses to multiple channels in the 3D DRAMs. It is usually designed to the dedicated 3D DRAM system. The interconnection among multiple channels is complex. The distributed controller has high flexibility to support different number of the active channels in the 3D DRAM system. But the memory bandwidth highly depends on the address mapping. Thus, we consider the mapping for the distributed controller in this work. According to the discussion above, both of the thermal and performance problems must be considered for the 3D DRAM systems. In this work, we proposed thermaland performance-aware address mapping (TPAMAP) for the multi-channel 3D DRAM systems. TPAMAP is the extension of our previously proposed thermal-aware mapping in [13] . In [13] , we only consider the thermal-aware mapping for the 3D multi-channel DRAM system. The random pattern is applied to verify the proposed mapping in [13] . TPAMAP reduces the stacking of the active banks in different DRAM channels as much as possible. Besides, the design constraints of the temperature and performance are considered in TPAMAP. Fig. 1 shows the design goal of TPAMAP. For the 3D DRAM, the mapping problem can be divided into two design aspects: high bandwidth and low temperature. For the high-bandwidth mappings, the highest memory bandwidth (BW) can be achieved by maximizing the parallel memory channels in the multi-channel 3D DRAM system. The high-bandwidth mapping can get higher BW than the lower bound (BW lb ), but the temperature may be also higher than the pre-defined thermal limit (T limit ), as P1 in Fig. 1 . It is because that the DRAM layer (layer 0) away from the heat sink may accumulate more heat than the others (layers 1, 2, and 3). For the low-temperature mapping, the lowest temperature in 3D DRAM chip can be achieved by moving most of the workload nearby the heat sink, as P2 in Fig. 1 . The temperature in 3D DRAM can be lower than T limit , but the BW may be also lower than BW lb because the workload is unbalanced in the DRAM layers. According to the temperature (T ) from thermal sensor and bandwidth requirement (BW), TPAMAP finds the suitable mapping to satisfy the constraints of T limit and BW lb (T < T limit and BW > BW lb ), as shown in Fig. 1 .
In this paper, our contributions are summarized as follows:
• TPAMAP is proposed to solve the thermal problems in the multi-channel 3D DRAM system and find the mappings with the design trade-offs of BW lb andT limit .
In our experiments, we analyze TPAMAP under the random patterns and real patterns of the x86-64 multicore architecture. The peak temperature can be reduced by 1.1 • C ∼ 12.3 • C compared to the related work in [6] .
• The cost function for the constraints of the temperature and performance is also proposed to find the suitable mapping to satisfy the constraints of T limit and BW lb (T < T limit and BW > BW lb ). Two cases of T limit and BW lb are discussed in our experiments.
• To realize the hardware design of TPAMAP, the hardware design flow is defined. The Verilog generator is also proposed to generate the Verilog HDL based on the DRAM trace file for the specific application. We also demonstrate the low-cost feature of TPAMAP by using the cell-based design [30] with 90-nm CMOS technology and Xilinx Spartan-6 FPGA design [31] . In Section II, we review some related thermal managements for the 3D DRAM system. In Section III, the problem formulation and the DRAM model are introduced. In Section IV, the proposed TPAMAP is discussed. In Section V and VI, the hardware overhead and the experimental results are discussed to demonstrate the features of TPAMAP. Section VII shows the conclusion.
II. RELATED WORKS
In the literature, many thermal management schemes were proposed to solve the thermal problems for the 3D DRAM systems [5] - [7] , [13] , [14] , [5] , [21] - [24] , [27] , [28] . In this section, these thermal managements are introduced.
A. THERMAL MANAGEMENTS FOR 3D DRAMS
In [21] and [22] , the refresh timing controls were applied to dynamically control the refresh period of the 3D DRAM system. If the maximal temperature is lower, the refresh rate is reduced to decrease the refresh power and peak temperature. In [23] , [24] , the thermal management were applied for the 3D stacked cache memory and multi-core systems. In [23] , the dynamic voltage and frequency scaling (DVFS) was applied to the processor cores. The number of the active memory banks is also re-allocated to each core. In [24] , the DVFS was applied to the hybrid cache memory (SRAM/MRAM L2 cache) and processor cores. Both of the throughput and the energy delay product can be improved. By using the DVFS, the peak temperature can be restricted below the predefined thermal limit. In [5] , the real-time data compression is designed for the HMC. Using the real-time compression to compress the read/write data, the number of the memory accesses is reduced and the time to throttle the overheated memory regions is shortened. The power and peak temperature are also decreased.
In [6] , [7] , [13] , [14] , [27] , [28] , the thermal-aware memory mappings were proposed for the 3D stacked DRAMs to solve the thermal problem by the stacking of the DRAM dies. In [14] , the memory address management was proposed to change the memory addresses from the high-temperature banks to the low-temperature ones. In [27] , the proposed temperature-and traffic-aware cache way placement was proposed for the 3D stacking of the multi-core processors and cache. The selection of the cache ways is based on the temperature and the bus congestion to minimize the leakage VOLUME 5, 2017 power dissipation. In [28] , the temperature-aware algorithm was proposed for a heterogeneous memory system, which combines a DDRx DRAM with a HMC. By reallocating the memory accesses between the DDRx DRAM and HMC, the hotspot is removed with the required performance. In [6] , the thermal-aware mapping for the 3D DRAM systems was proposed to reduce the peak temperature by avoiding the stacking of the active banks in the DRAM. In [7] , the thermalaware mapping from [6] was applied for the 3D stacked memory and processor architecture to avoid the memories accessed in the vertical locations. In our previous work [13] , the thermal-aware mapping for the 3D multi-channel DRAM system was proposed. The thermal problem from the vertical stacking between two adjacent DRAM layers in different channels can be reduced.
In the aforementioned the researches of the thermal management, the refresh timing controls and the real-time data compression must be supported by the extra hardware design for the memory controller [5] , [22] . For the thermal-aware mappings, only the DRAM addresses are needed to be remapped. The hardware cost is low for the design of the mapping circuits [6] . In [27] , [28] , the thermal-and performance-aware algorithms were designed for the dedicated memory systems (3D stacking of the multi-core processors and cache [27] and heterogeneous memory system combines a DDRx DRAM with a HMC [28] ). For the memory mappings target to the general 3D DDRx DRAM systems, only the thermal problem was considered in the relate works [6] , [7] , [13] , [14] . According to the aforementioned introductions, we focus on the thermal-aware mapping for the 3D DRAM system and consider both of the thermal and performance problems for the multi-channel 3D DDRx systems.
Thermal-aware mappings must consider two features: 1) memory accesses and 2) vertical stacking. The thermalaware mappings are highly dependent on its applications. Different applications may result in the unbalanced memory accesses in the 3D DRAM. To balance the thermal distribution in the 3D DRAM, the mapping must re-allocate the memory accesses according to its target application. For the 3D DRAM architecture, the vertical stacking of the active banks results in the accumulation of the power. It also causes the increases of the temperature rapidly. The related mappings are introduced in Sections II.B and II.C.
B. DRAM MAPPING CONSIDERING MEMORY ACCESSES [6] , [14] In [6] , [14] , the authors analyzed the thermal behaviors for a 3D multi-core system with the stacked DRAM dies. According to the thermal analysis, the memory address management policies were proposed to map the frequently accessed memory addresses from the high-temperature memory banks to the low-temperature ones. The results show that the proposed policy can reduce the peak temperature and the thermal variation.
C. DRAM MAPPING CONSIDERING VERTICAL STACKING [6] , [7] In [6] , the authors proposed a thermal-aware mapping (TAMAP) for the 3D DRAM systems. TAMAP reduces the peak temperature by avoiding the stacking of the active banks in the DRAM. The reason is that the stacking of the active banks may accumulate the power in the sense amplifiers due to their small area size and high power density.
Besides, TAMAP was also applied to other works. In [7] , the authors proposed the thermal-aware memory mapping for the 3D stacked memory and processor architecture. TAMAP [6] was applied to avoid the memory accessed in the vertical locations.
According to the aforementioned works, the memory mapping must consider both of the memory accesses and vertical stacking. In the 3D DRAM systems, directly applying TAMAP [6] for each DRAM channel can avoid the thermal problem within each channel. But the active banks in different channels may still be stacked in the vertical direction. The reason is because the active banks for different channels are accessed simultaneously for the multi-channel 3D DRAM systems. Therefore, the memory mapping must consider the vertical stacking of the active banks between different channels. In this work, we extend the idea of TAMAP [6] for the multi-channel 3D DRAM system. Both of the memory accesses and vertical stacking are considered in our proposed TPAMAP.
III. PROBLEM FORMULATION AND DRAM MODELING
In Section III.A, the performance and thermal issues for the multi-channel 3D DRAM system are introduced. In Section III.B, the DRAM model in our 3D DRAM system is introduced.
A. PROBLEM FORMULATION
The thermal and performance issues for the multi-channel 3D DRAM system can be discussed in two aspects: 1) memory accesses for the multi-channel DRAM systems and 2) memory accesses for the 3D DRAM architectures, as shown as follows:
1) MEMORY ACCESSES FOR THE MULTI-CHANNEL DRAM SYSTEMS
In modern DRAM chip, the data width is often defined as 4, 8, and 16 bits, such as Micron DDR2 DRAM [11] . To support the memory access greater than 16-bit data width, many DRAM dies must be active in parallel. Thus, the DRAM system is designed to support many physical DRAM channels and increase the transfer speed. This architecture is called as the multi-channel DRAM architecture, as the example in Fig. 2(a) . In Fig. 2(a) , there are four channels (C0 ∼ C3) in the DRAM architecture and one 8-bit DRAM dies per channel. The memory access of the DRAM dies in each channel is independent. If the processor requests the 32-bit memory data through the memory 5568 VOLUME 5, 2017 FIGURE 2. Problem formulation: (a) memory accesses for the multi-channel DRAM system and (b) memory accesses for the 3D DRAM architecture.
controller, four memory dies from different channels must be active.
For the multi-channel DRAM system, the allocation of the memory accesses may influence the memory bandwidth seriously. If the memory accesses are classified as the memory accesses for the multi-channel 3D DRAM system (DRAM access, DA), the channels (channel access, CA), and the banks (bank access, BA), we can define the relation among DA, CA, and BA as in (1):
In (1), the CA for the channel i is CA i and the BA for the bank j in channel i is BA ij . NC is the number of the channel in the multi-channel DRAM system. NB is the number of the memory bank in each channel. In (1), DA can be expressed as the summation of the memory accesses of the channels (CA i ) and the banks (BA ij ). The allocation of CA i and BA ij influences the performance seriously. The DRAM mappings can allocate the number of memory accesses in different channels and banks by remapping the memory addresses. Thus, multiple channels provide the performance issue to design the highperformance mapping.
2) MEMORY ACCESSES FOR THE 3D DRAM ARCHITECTURE
To implement the multi-channel DRAM system in the TSV 3D stacked architecture, many DRAM dies may be stacked in different layers, as the example in Fig. 2(b) . In the multichannel DRAM system, parallel channels are applied to increase the performance. However, the parallel memory accesses in the 3D DRAM may cause the thermal issue, as the case in Fig. 2(b) . In Fig. 2(b) , the DRAM dies in different channel are stacked in the 3D chip. In this architecture, if the active banks in the DRAM dies are accessed in different channels, the active banks may be located in the vertical direction. It may cause serious thermal problem because the heat accumulated rapidly in the sense amplifier [6] . Besides, different DRAM dies have different heat dissipation paths in the 3D DRAM. In Fig. 2(b) , the banks in the top channel (C0) have longer path for heat dissipation than the ones in other channels (C1, C2, and C3). The unbalanced paths between the channels and the heat sink may cause the design issue for the thermal-aware mapping. In the thermal-aware mapping, the DRAM accesses are re-allocated in time and space domains to balance the heat dissipation in different channels.
B. DRAM MODEL
In Section III.B, we introduce the DRAM model. The DRAM model is applied to construct the 3D DRAM system and demonstrate the features of the proposed mapping algorithm in the experiments. We apply the DRAM model from the typical DRAM (TDRAM) in [6] . Fig. 3(a) shows the floorplan of TDRAM. For each memory access, the control and precharge circuits (CP) is active. If the bank is accessed, the related cell bank (Bank), sense amplifier (SA), and peripheral circuits (Peri) are active. Only one bank in a DRAM die can be active because of the shared data bus (IO). Fig. 3(b) shows the power profile of the TDRAM, which follows the setup in [6] . In TDRAM, SA has very high power density in very small area. If the active bank stacked in the same location in 3D DRAM, SA may cause very high temperature.
According to the definition in Section III.A, the multichannel DRAM system applies more DRAM channels to increase the DA, as (1) . However, the allocations of CAs and BAs for different channels influence the performance and temperature in multi-channel DRAM system. The unbalanced CAs and BAs may result in the vertical stacking of the active SA and CP in the adjacent channel and influence the peak temperature in the multi-channel 3D DRAM system. Besides, the unbalanced CAs also cause the performance degradation. Hence, the thermal-and performance-aware mapping must be considered to solve both of the thermal and performance problems in this work, which is discussed in Section IV.
IV. THERMAL-AND PERFORMANCE-AWARE ADDRESS MAPPING (TPAMAP)
In Section IV, the proposed TPAMAP is introduced. Fig. 4 shows the key concept of TPAMAP. To increase the memory bandwidth of a multi-channel DRAM system, the memory interleaving is applied to the DRAM dies in multiple channels [8] . The memory interleaving makes DRAM dies be accessed at the same time, as shown in Fig. 2(a) . To support the memory interleaving, the memory access to the DRAM dies in each channel must be controlled by the memory controllers separately. Thus, directly applying TAMAP [6] to each channel can avoid the vertical stacking of the active banks in one channel, but the thermal problem from the active banks which belong to different channels cannot be avoided, as shown in Fig. 4(a) . First, the active banks with high BAs in C0 may result in the accumulation of the temperature. It is because the DRAM layers in C0 has longer heat dissipation path than the layers in C1. Second, the active banks are controlled individually in different channels; the active banks with high BAs from different channels may be stacked in the adjacent DRAM layers, as shown in Fig. 4(a) . The stacking of the active banks may result in the accumulated temperature rapidly. The thermal problems can be solved by remapping the banks in different locations of the 3D DRAM. However, the mapping algorithm also influences the parallelism of the DRAM channels in the multi-channel 3D DRAM. To solve both of the thermal and performance problems, the designtrade off in TPAMAP is also discussed, as Section IV.C.
TPAMAP solves the aforementioned problems in two aspects: 1) Inter-channel bank reordering (ICBR) to reduce the stacking of the active banks with high BA in the adjacent layers, as shown in Fig. 4(b) ; 2) Inter-channel bank swapping (ICBS) to swap the banks according to the design trade-offs of the thermal and bandwidth constraints, as the example in Fig. 4(c) . For the 4-channel 3D DRAM in Fig. 4(c) , TPAMAP can support the mappings with 4 distributions of CAs, as MAP 1 ∼ MAP 4 . CA 0 , CA 1 , CA 2 , and CA 3 are the CAs for the channels C0, C1, C2, and C3. The CAs are concentrated to 1 ∼ 4 channels by using MAP 1 ∼ MAP 4 . {C3}, {C2,C3}, {C1,C2,C3}, and {C0,C1,C2,C3} are selected in MAP 1 ∼ MAP 4 . These channels are closer to the heat sink and have better abilities to dissipate heat. The second column in Table 1 shows the conditions of these mappings. For an n-channel 3D DRAM system, equation (2) shows the general rule of MAP 1 ∼ MAP n for the n-channel 3D DRAM. The parameter x is between 1 to n. The CAs are concentrated to 1 ∼ n channels by using
The third column of 5 shows the flow chart of TPAMAP to realize MAP x (x is in the range of 1 to n) for n-channel 3D DRAM system. To realize MAP x , three cases of the parameter Iteration are executed. These cases are Iteration = 1, Iteration = 2 ∼ x, and Iteration = x + 1. If Iteration = 1 is satisfied, the step 1 of ICBS is executed. In the step 1, the banks with high BA are remapped to the ones close to the heat sink as much as possible. If Iteration is in the range of 2 to x, the step 2 of ICBS is executed. The goal of the step 2 is to swap the banks close to the best case in Table 1 . If Iteration = x + 1 occurs, ICBR is executed to reduce the vertical stacking of the adjacent DRAM layers in different channels. In the flow chart of TPAMAP, Iteration are different for different parameter x.
VOLUME 5, 2017
For MAP 1 ∼ MAP n , the step 1 of ICBS and ICBR must be executed. The number of the execution for the step 2 of ICBS is x -1, if x is larger than 1. For instance, the step 2 of ICBS is not executed in MAP 1 ; the step 2 of ICBS runs 3 times for MAP 4 . ICBR is described in Section IV.B. In Section IV.C, the cost function to determine the mapping under the constraints of the bandwidth and temperature is introduced.
A. INTER-CHANNEL BANK SWAPPING (ICBS)
In inter-channel bank swapping (ICBS), the banks in the upper memory layer are swapped to the ones in the lower memory layers to redistribute CAs and BAs in different mappings. The balanced CA and BA in each channel can achieve higher performance, but it may cause unbalanced thermal distribution in the 3D DRAM system. The reduction of the active channels can decrease the peak temperature, but it also influences the system performance. To find the trade-off between temperature and performance, ICBS must support the mappings with the CAs concentrated to 1 ∼ n channels, as the example in Fig. 4 (c).
1) SORTING THE BANKS WITH BA
In the step 1, the banks with high BA are remapped to the channel closest to the heat sink, as C3 in Fig. 6(a) . The goal is to remove the heat as much as possible in the multichannel 3D DRAM system. To realize this feature, ICBS sorts the banks according to their BAs. After that, the banks are remapped to different channels according to the BAs. The banks with high BAs are remapped the channels close to heat sink. Equation (3) shows the rules to remap the banks in the n-channel 3D DRAM. Fig. 6(a) shows an example of the 4-channel 3D DRAM switching to MAP 1 . Fig. 6 (b) shows the pseudo code for the step 1 of ICBS. First, the BAs from NB banks in NC channels are added to one list (BAlist) by Loop1 and Loop2. Second, the BAs in BAlist are sorted decreasingly by using the function of Sort-Dec(). The sorted results are stored to another list, OBAlist. In Loop3 and Loop4, the BAs in channels NC -1 ∼ 0 are filled according to the order of the sorted BAs in OBAlist.
2) SWAPPING THE BANKS BASED ON CAs
In the step 2, MAP 2 ∼ MAP n for the n-channel 3D DRAM can be executed according to the CAs. Fig. 7(a), (b) , and (c) shows an example to switch the 4-channel 3D DRAM from MAP 1 to MAP 3 . Fig. 7(a) shows the CA of the 3D DRAM in MAP 1 . First, MAP 1 can switch to MAP 2 by swapping the banks between two channels, C2 and C3. In this step, CA 3 is reduced and CA 2 is increased, as Fig. 7(b) . The goal is to make CA 2 and CA 3 close to the best cases of CA 2 and CA 3 in MAP 2 (DA/2), as the example in Table 1 . The condition of the swapping between two channels (channel x and y, where x < y) is shown in (4). The banks between C2 and C3 can be swapped if (CA 2 < DA/2) and (CA 3 DA/2) are satisfied.
To achieve MAP 3 , the banks are swapped among C1, C2, and C3. The goal is to make CA 1 , CA 2 , and CA 3 close to the best case in MAP 3 (CA 3 = CA 2 = CA 1 = DA/3), which is shown in Table 1 . CA 1 is increased, but CA 2 and CA 3 are decreased. Thus, the banks in C1 must be swapped to the ones in C2 and C3 iteratively. The condition in (4) is also applied to switch the banks between two channels, {C1, C2} and {C1, C3}. Fig. 7(c) shows the example of MAP 3 for the 4-channel 3D DRAM. The Fig. 7(d) shows the pseudo code for the step 2 of ICBS. First, the BAs in two channels (Cxand Cy) are sorted increasingly and decreasingly by using the functions of VOLUME 5, 2017 Sort-Inc() and Sort-Dec(), as Loop1 and Loop2. In Loop3 and Loop4, the BAs in Cx and Cy, BA (x)(bx) and BA (y)(by) , are swapped by Switch() according to the conditions in (4). The swapped BAs are also added to one list, BPlist. The BAs in BPlist are not swapped again to reduce the unnecessary bank swapping between two channels.
B. INTER-CHANNEL BANK REORDERING (ICBR)
For the thermal problem from the vertical stacking of the active banks in different channels, directly switching the bank with the highest BA (hot bank) to the one with the lowest BA (the cold bank) in the same DRAM layer may remove the thermal hotspot. But it may also cause another hotspot at different location. To solve this problem, ICBR is proposed in this work. ICBR considers all of the banks in the adjacent DRAM layers to reduce the vertical stacking of the active banks in different channels. The major concept of ICBR is to reduce the stacking of the bank with high BAs in the adjacent DRAM layer. In the DRAM system, multiple DRAM dies can be accessed at the same time by a given memory address. The goal is to increase the data width and throughput. In each DRAM die, only one DRAM bank can be accessed at one time. It is due to the shared hardware components and data bus [6] . In inter-channel bank reordering (ICBR), the vertical stacking of the active banks in different channels can be reduced. Fig. 8(a) and Fig. 8(b) show an example of the multichannel 3D DRAM architecture to apply ICBR between two channels (C0 and C1). Each channel contains two DRAM layers. Two DRAM dies are located in the same DRAM layer. 16-bit memory data are required from the memory controller and two DRAM dies with 8-bit data width in one channel are active simultaneously. The numbers 0 ∼ 7 represent the active banks from the active DRAM dies in the same channel simultaneously, which are called sets 0 ∼ 7 [6] . In each channel, the vertical stacking of the active banks is avoided by TAMAP [6] . By using TAMAP [6] , the active banks are not located in the vertical direction. For the vertical stacking of the active banks between different channels, ICBR is applied, as the example in Fig. 8(b) . To solve the thermal problem between two adjacent channels, the BAs for each set are collected. Then, the banks in one channel (C0) are sorted in decreasing order according to their BA; on the contrary, the banks in another channel (C1) are sorted in the increasing order. After that, the stacking of the memory banks can be defined as four pairs of set: {0,7}, {2,5}, {3,4}, and {1,6}. According to these groups, the sets in C1 are reordered. The locations of set 7 and set 6 are changed; the locations of set 4 and set 5 are switched, too. The goal is to guarantee the pairs of set are stacked in the adjacent DRAM layers. Thus, the thermal problem from the vertical stacking can be reduced. Fig. 8(c) shows the pseudo code for ICBR. In Loop1, the banks in two adjacent channels, Cx and Cx + 1, are sorted in the increasing and decreasing orders by Sort-Inc() and Sort-Dec(). The sorted results of Cx and Cx + 1 are added to two lists, xlist and ylist. By the order of the components in xlist and ylist, the pairs of sets, {xlist (0) ,ylist (0) } ∼ {xlist (NB−1) ,ylist (NB−1) }, can be determined, as Pair-of-sets (xlist,ylist). After that, the BAs in Cx + 1 (BA (x+1)(0) ∼ BA (x+1)(NB−1) ) are reordered by Pair-of-sets (xlist,ylist). Thus, the active BAs stacked between Cx and Cx + 1 are reduced and the thermal problem is also decreased.
C. PERFORMANCE-AND THERMAL-AWARE DESIGN TRADE-OFFS
For the multi-channel 3D DRAM system, the number of the active channels influences the system power, latency, and temperature. The increases of the active channels reduce the latency, but the power and the temperature are raised. The raising temperature is caused by the powers of the active banks accumulated in vertical direction. Thus, we focus on the design trade-offs by enabling different active channels with the thermal limit (T limit ) and the low bound of the bandwidth (BW lb ). The goal is to find a suitable mapping satisfying the constraints of T limit and BW lb (T < T limit and BW > BW lb ). To find the suitable mapping under the constraints of the bandwidth and temperature, the designers can define the cost function according to its application. In this paper, we consider the cost function with the constraints of T limit and BW lb . For the mapping with temperature T and bandwidth BW, the cost can be calculated as in (5) and (6) . If T is less than or equal to T limit and BW is great than or equal to BW lb , the cost can be calculated as in (5):
If T is higher than T limit or BW is less than BW lb , the mapping does not satisfy our design goal. Thus, cost is defined as infinity (INF), as in (6):
According to (5) and (6), the mapping with higher BW and lower T can achieve lower cost. For the cases in our experiments, we analyze the cost for different parallel channels. The detailed simulation results are shown in Section VI.D.
V. HARDWARE ARCHITECTURE DESIGN
In Section V, the hardware architecture design of TPAMAP is introduced. TPAMAP can be embedded to the memory controller to control the memory accesses for the 3D DRAM system. TPAMAP remaps the channel and bank addresses among different channels in the 3D DRAM system. We also demonstrate the low-cost feature of TPAMAP in our hardware implementation. In Fig. 9 , the block diagram of the 4-channel 3D DRAM system containing TPAMAP is shown. TPAMAP is integrated into the memory controller. TPAMAP remaps the memory address according to the concentration of the CAs and BAs. The inputs of TPAMAP are Badr, Cadr, and sch. Badr and Cadr are 4-bit bank addresses and 2-bit channel addresses for the 4-channel 3D DRAM. 4-bit bank addresses are required to indicate the banks in each channel. sch is a 2-bit control signal to select one of the mappings (MAP 1 , MAP 2 , MAP 3 , and MAP 4 ); According to sch, Badr, and Cadr are remapped to Badr' and Cadr'. The remapping table of the BAs and CAs can be implemented by the read-only memories (ROMs), as shown in Fig. 9 .
In Fig. 10(a) , the proposed hardware design flow of TPAMAP is a static analysis tool to realize the hardware design of the TPAMAP for the pre-defined application. By the DRAM trace file from the application, the BAs and CAs can be analyzed. The DRAM trace file is achieved by recording the DRAM accesses in the full-system simulation tool, MARSS×86 [15] . Based on the BAs and CAs, the remapping table for different active channels is constructed. In the remapping table, the bank addresses and channel addresses VOLUME 5, 2017 (Badr and Cadr) is remapped to the Badr' and Cadr' for different mappings (MAP 1 ∼ MAP 4 ). After that, the proposed Verilog generator can generate the Verilog HDL to describe the architecture design of TPAMAP with the remapping table, as the example in Fig. 9 . The remapping table is realized in the ROMs. Thus, the hardware overhead of TPAMAP is low. Finally, the EDA tools can realize the hardware design flow of TPAMAP to verify the low-cost feature of TPAMAP. In our experiments, the cell-based design and FPGA design are applied to verify TPAMAP. The cell-based design is realized by CIC EDA cloud [30] with 90-nm CMOS technology; the FPGA design flow is realized by Xilinx Spartan-6 LX45 FPGA [31] .
In Fig. 10(b) , the flow chart of the proposed Verilog generator is shown. The Verilog generator can support the multichannel 3D DRAM with different number of the banks (NB) and channels (NC). Thus, the first step is to generate one multiplexer (MUX) according to NB and NC. In the second step, the ROMs is generated to realize different mappings (MAP 1 ∼ MAP n ) according to Badr' and Cadr'. In this work, the cell-based design and FPGA design are applied to realize the low-cost feature of TPAMAP. In the cell-based design, we apply CIC EDA cloud [30] with 90-nm CMOS technology to realize the hardware design; the area results are summarized as Table 2 . 232 ∼ 593 gate counts are required to implement the design of TPAMAP for different input patterns. In FPGA design, we apply Xilinx Spartan-6 LX45 FPGA [31] to verify the hardware design of TPAMAP. 23 ∼ 54 LUTs are required to realize TPAMAP in the Spartan-6 FPGA, as shown in Table 3 . The traffic patterns in these cases are introduced in Section VI. 
VI. EXPERIMENTS
In Section VI, we analyze the thermal and performance results by using the proposed TPAMAP. To calculate the thermal and performance results for the 3D multichannel DRAM system, DRAMSim2 [9] , HotSpot 5.02 [10] , and MARSS×86 [15] are combined in our simulation methodology. For DRAM system simulator, DRAMSim2 [9] is applied. For thermal simulation, HotSpot 5.02 [10] is used. The parameters of HotSpot 5.02 are based on the setup in [6] . For the real patterns, MARSS×86 [15] are applied to run the cycle accurate full system simulation of the x86-64 multi-core architecture. The real patterns from x86-64 multi-core architecture are applied to analyze the feature of TPAMAP. We also consider the scheduling for multiple memory accesses from multiple processor cores. An 8-tier 4-channel 3D DRAM is simulated in our experiments. Each DRAM tier contains two dies and each channel contains 4 dies, which are located in two DRAM tiers. Table 4 summarizes the system parameters in the experiments. In our experiments, the random patterns and the real patterns are applied to verify the proposed TPAMAP. In the random patterns, we consider the distributions of the CAs and BAs to demonstrate the features of TPAMAP. For CAs, the cases of the balanced channel accesses (BCA) and unbalanced channel accesses (UCA1 and UCA2) are considered. The target 3D DRAM have four channels (C0, C1, C2, and C3), and the ratio of the channel accesses are shown as Fig. 11(a) . For BAs, four cases of the unbalanced BAs (UBA1, UBA2, UBA3, and UBA4) are discussed. Each DRAM die have four banks (B0, B1, B2, and B3), and the BAs are shown as Fig. 11(b) . By combining the aforementioned cases of the CAs and BAs, we can achieve 12 cases of the unbalanced distributions for the memory accesses in our experiments. The detailed experiments of the random patterns are introduced in Sections VI.A and VI.B. The experiment of the real pattern is discussed in Section VI.C. 
A. RANDOM PATTERNS WITH BALANCED CAs AND UNBALANCED BAs
In Section VI.A, the random patterns with the balanced CAs (BCA) and the unbalanced BAs (UBA1 ∼ UBA4) are applied to verify TPAMAP, as Fig. 11(a) and (b) . BCA represents that the memory accesses to different channels are allocated with unequal probability. Fig. 12(a) shows the thermal results for different cases. Different distributions of the mappings (MAP 1 , MAP 2 , MAP 3 , and MAP 4 ) represents that the CAs are concentrated to 1, 2, 3, and 4 active channels. These mappings result in the trade-offs between temperature and bandwidth. In our experiments, TAMAP [6] is compared with the proposed TPAMAP. In TPAMAP, the active banks in the upper DRAM layers can be remapped to the lower layers by ICBS; the vertical stacking of the adjacent layer is reduced by ICBR. ICBS and ICBR are introduced in Sections IV.A and IV.B. Hence, the peak temperature in TPAMAP is lower than the one in TAMAP [6] , as shown in Fig. 12(a) . Comparing to TAMAP [6] , the peak temperature can be reduced by 1.1 • C ∼ 12.3 • C for MAP 1 ∼ MAP 4 . Fig. 12(b) shows the bandwidth by using TAMAP [6] and TPAMAP. Comparing to TAMAP [6] , the bandwidth is reduced by 0% ∼ 53.4% for MAP 1 ∼ MAP 4 . The result shows that TPAMAP can support the trade-offs between the bandwidth and temperature. The increase of the active channels in the mappings also causes the temperature and bandwidth to be higher.
B. RANDOM PATTERNS WITH UNBALANCED CAs AND UNBALANCED BAs
In Section VI.B, the random patterns with two cases of the unbalanced CAs (UCA1 and UCA2) and the unbalanced BAs (UBA1 ∼ UBA4) are also applied to demonstrate the advantages of TPAMAP. Unbalanced CAs and the unbalanced BAs are shown in Fig. 11(a) and Fig. 11(b) . UCA1 and UCA2 show that the memory accesses to different channels are allocated with unequal probability. The thermal results and bandwidth are shown in Fig. 13(a) and Fig. 13(b) . The trends are similar to the ones in Fig. 12(a) and   FIGURE 13 . TAMAP [6] and TPAMAP are compared under the random patterns with two cases of the unbalanced CAs (UCA1 and UCA2) and the unbalanced BAs (UBA1 ∼ UBA4): (a) Thermal result and (b) memory bandwidth. Fig. 12(b) . TPAMAP has lower peak temperature than TAMAP [6] with the bandwidth reduction. Comparing to TAMAP [6] , the peak temperature can be reduced by 2.6 • C ∼ 11.4 • C and the bandwidth reduced by 0% ∼ 54.9% for MAP 1 ∼ MAP 4 .
C. REAL PATTERNS WITH MARSS×86 [15] In Section VI.C, the real patterns from ×86-64 multi-core architecture are applied to analyze the feature of TPAMAP. MARSS×86 [15] is applied to run the cycle accurate full system simulation. MARSS×86 [15] can support two steps of the simulation. First, different software benchmarks (Ubuntu 11.04, Parsec 2.1, Splash2, etc.) are executed to simulate the memory accesses to the 3D DRAMs in real cases. In MARSS×86, virtual memory model of QEMU (soft-MMU) can handle the memory management between the virtual memory and physical memory. The memory accesses are also kept in the memory traces. After that, the memory traces can support the memory simulation, DRAMSim2 [9] , to analyze the temperature and bandwidth of the multi-channel 3D DRAM system by using different address mapping and DRAM mapping algorithms. To support the memory interleaving between different channels, the address mapping described in [8] is applied.
In the multi-core system, multiple processor cores can require the DRAM accesses at the same time. To demonstrate this feature in the real application, the multi-core system contains four processor cores is considered in our simulation. Each processor core can execute one individual application, as shown in Table 5 . Besides, we assume that the 3D DRAM system is shared by the processor cores. Firstin-first-service (FIFS) scheduling [6] is applied to solve the multiple memory accesses from different processor cores. are shown in Fig. 15(a) and Fig. 15(b) . The trends are similar to the ones in Fig. 12 and Fig. 13 . Comparing to TAMAP [6] , the peak temperature is reduced by 1.6 • C ∼ 5.9 • C and the bandwidth is reduced by 1.9% ∼ 21.6% for MAP 1 ∼ MAP 4 .
D. TRADE-OFFS FOR THE BANDWIDTH AND TEMPERATURE
In Section VI.D, we consider the trade-offs of the bandwidth and temperature for the 4-channel 8-layer 3D DRAM using TPAMAP. TPAMAP supports the mappings with the design trade-offs and reduces the effect of the vertical stacking in different channels. Thus, the design trade-offs must be discussed with the constraints of the bandwidth and temperature. Equations (5) and (6) shows the cost function of this work, as shown in Section IV.C. The goal is to find the mapping with lowest cost. In Table 6 , we define the predefined thermal limit (T limit ) and the low bound of the bandwidth (BW lb ) as 76 • C and 1.5 GB/s to calculate the cost function. Table 6 shows the results of the cost function for the cases in the random patterns (BCA, UCA1, and UCA2) and the real pattern (real). The texts with gray background show the mappings with lowest cost. For instance, MAP 3 is TABLE 6. The cost for the cases in the random patterns (bca, uca1, and uca2) and the real pattern (real) with the conditions of t limit = 76 • C and BW lb = 1.5 gb/s.
TABLE 7.
The cost for the cases in the random patterns (bca, uca1, and uca2) and the real pattern (real) with the Conditions of t limit = 72 • C and BW lb = 1.2 gb/s. applied for the cases (UBA1 ∼ UBA4) in BCA because the cost is lowest. In Table 7 , we consider another design case with T limit and BW lb as 72 • C and 1.2 GB/s. In comparison with the results in Table 6 , lower T limit and BW lb change the selected mappings. In Table 7 , MAP 4 are not selected in different cases because the cases applying MAP 4 have higher temperature than T limit . For the cases of UBA4 in the random patterns (BCA, UCA1, and UCA2), none of the mappings can be found. The reason is because the temperatures are higher thanT limit in these cases. It means that the constraints of T limit and BW lb must be relaxed. Higher T limit and lower BW lb are required to find the suitable mapping.
VII. CONCLUSION
In this work, we proposed TPAMAP to solve the thermal and bandwidth problems in the multi-channel 3D DRAM systems. TPAMAP applies Inter-channel Bank Swapping (ICBS) and Inter-channel Bank Reordering (ICBR) to adjust the channel accesses (CAs) and bank accesses (BAs) in the multi-channel 3D DRAM systems. TPAMAP can also support the mappings with the design trade-offs of bandwidth and temperature and reduce the vertical stacking of the active banks in different channels. In our experiments, the peak temperatures can be reduced by 1.1 • C ∼ 12.3 • C with some bandwidth degradation.
We also propose the cost function to find the suitable mapping with design constraints. In hardware design, we demonstrate the low-cost feature of TPAMAP. 232 ∼ 593 gate counts and 23 ∼ 54 LUTs are required to implement the design of TPAMAP in the cell-based design and FPGA design. In the future, we will apply TPAMAP to improve both of the thermal and performance problems for different 3D DRAM architectures.
