ABSTRACT Computer servers are equipped with an increasing number of memory modules each with more capacity, making main-memory systems now the second most energy-consuming component trailing only processors in big-memory servers. These big-memory servers and their main-memory systems should offer high energy efficiency. In pursuit of energy-efficient main-memory systems, prior work exploited mobile low power double data rate (LPDDR) devices' advantages (lower power than DDR devices) while attempting to surmount their limitations (longer latency, lower bandwidth, or both). However, we show that such main-memory architectures (based on the latest LPDDR4 devices) are no longer effective and even hurt overall energy efficiency of servers by 49% on memory-intensive workloads compared with ones based on DDR4 devices. The reason is that the power consumption of modern DDR4 devices has substantially reduced by adopting the strength of mobile and graphics memory whereas LPDDR4 has focused more on increasing data transfer rates while sacrificing energy efficiency; the power consumption of DDR4 devices can significantly vary across manufacturers in this analysis. Moreover, exploring new energy-saving features of DDR4 devices in depth, we show that activating these features often hurts overall energy efficiency of servers because of their performance penalties. Subsequently, we propose a simple but effective scheme that adaptively exploits DRAM power-down modes and hence improves the system energy-delay product by 4.0%.
To build more energy-efficient main-memory systems for big-memory servers, memory architectures exploiting LPDDR devices were proposed [31] , [48] . This was because LPDDR devices, which mainly target mobile computing, consumed much lower power than DDR devices (but at the cost of longer latency and lower bandwidth). However, in this paper we first exhibit that such main-memory architectures do not make the entire server system more energy-efficient compared to the ones based on the latest DDR devices (i.e., DDR4) in most usage scenarios anymore. This is due to the fact that both LPDDR and DDR devices have evolved over generations and the power consumption of current DDR4 devices has substantially decreased by adopting the strength of mobile and graphics memory, whereas the latest LPDDR4 has focused more on increasing data transfer rates while sacrificing energy efficiency.
In particular, we show that DDR4 is far more energyefficient than DDR3 because it is manufactured with finer-pitch technology and it adopts various advanced circuit-level techniques to aggressively reduce static power consumption. This leads to smaller relative power consumption gap between DDR4 and LPDDR4 (39%) than between DDR3 and LPDDR2 (77%). Moreover, during this analysis, we also discover that the static power consumption of DRAM devices varies substantially across DRAM manufacturers (up to 2.2×) and may choose more energyefficient DDR4 devices for big-memory servers; total power consumption of DDR4 devices from one manufacturer is 16-28% lower than ones from two other manufacturers.
Subsequently, we present an in-depth analysis on new energy-saving features offered by modern DDR4 devices and demonstrate that they (e.g., data bus inversion (DBI)) often hurt overall energy efficiency of big-memory servers [5] because they incur performance penalties. This underscores the importance of offering energy-saving technologies that do not incur notable performance penalties. We also propose a simple but effective scheme that exploits DRAM power-down modes adaptively which improves the energy-delay product (EDP) of a simulated big-memory system with eight energyefficient DDR4 ranks per channel by 4.0% on memory intensive multi-programmed workloads.
In summary, we make the following key contributions:
• As opposed to prior proposals based on LPDDR2 devices, we show that main-memory architectures exploiting the advantages of LPDDR4 devices do not make big-memory servers more energy-efficient than the ones based on DDR4.
• In the course of exhibiting why DDR4 devices are not energy-inefficient any more compared to LPDDR4 devices, we expose that static power consumption of DDR4 devices notably varies across manufacturers.
• We present an in-depth analysis on new energy-saving features supported by contemporary DDR4 devices and show that those features are mostly not effective when we consider overall system-level energy efficiency.
• We enhance energy-saving features of DDR4 devices to improve the energy efficiency of big-memory servers and evaluate their impacts on system performance and energy efficiency. The rest of this paper is organized as follows. Section II provides the pertinent details of DRAM organization, operations, and power breakdown. Section III summarizes recent progresses in energy-efficient main-memory designs and shows that the energy efficiency of DDR4 has improved significantly compared to previous generations. Section IV analyzes energy efficiency and performance trade-offs of contemporary DRAM devices and their energy-saving features. Section V proposes schemes which improve mainmemory energy-efficiency without notably compromising performance. Section VI describes our experimental setup. Section VII comprehensively evaluates the performance and energy efficiency of various DRAM devices and powersaving schemes. Section VIII concludes this study.
II. BACKGROUND A. DRAM ORGANIZATION AND OPERATION
Main memory DDRx DRAM devices are organized to achieve high capacity and bandwidth with reasonable latency and energy efficiency under a stringent cost constraint [23] ( Figure 1 ). Mobile LPDDRx [19] and graphics GDDRx [17] are organized similarly, but the former focuses more on energy efficiency whereas the latter emphasizes high data transfer rates per device. A modern DDR4 DRAM die stores 4Gb or 8Gb of data, consists of 16 banks, and has 4 (×4) or 8 (×8) data pins typically, each transferring data at the rates equal to or above 1.6Gbps. Each bank has a 2D array of DRAM cells, where a cell consists of an access transistor and a capacitor. In order to achieve high area efficiency, cells in a bank share wires and peripheral circuitry of both control and datapath. As a DRAM bank comprises hundreds of millions of cells, the number of cells connected to a wordline (WL) or a bitline (BL) becomes too excessive, and both BLs and WLs are structured hierarchically. For datapath, each bank has dozens of rows of BL sense amplifiers (BLSAs) and there exist global datalines that span the entire height of the bank. Because the number of global datalines per bank, which is equal to the number of bits transferred per read/write transaction, is much smaller than the row (page) size of a bank, (de)multiplexers called local datalines exist per row of BLSAs.
This sharing of wires and circuitry goes beyond a DRAM bank boundary. All banks in a DRAM die work independently except that they share datapath and control wires. One or more DRAM dies are packaged in a DRAM device. Multiple dies stacked in a DRAM device are connected by through-silicon vias (TSVs) or wire-bonding pads. Several dies across DRAM devices are grouped and operate in tandem receiving the same command and address signals, constituting a rank. A memory controller and multiple ranks are connected through a single memory channel, where command, address, and data signals are transferred. One or more ranks of DRAM devices are placed together on a module. The number of datapath wires in a memory channel is 64 in a modern dual-inline memory module (DIMM) excluding optional 8-bit wires for error checking and correction (ECC).
Popular DRAM devices, such as DDR3 [18] , DDR4 [19] , LPDDR4 [21] , and GDDR5 [17] , access data through a sequence of commands. To access data in a bank, the row including the data should first be latched to the corresponding BLSAs using an activate (ACT) command. After tRCD since ACT is issued, a read (RD) or write (WR) command can be issued to specify the column location within the latched row, and it takes tCL (tWL) to have the first data popped out of (shipped to) the device for RD (WR) and takes tCCD S to transfer a burst of data. Data in the selected cells are destroyed during row activation, and hence should be restored to keep the value, taking tRAS. WR needs time to update the data in the corresponding DRAM cells, defined as write recovery time or tWR. Once data are restored or updated, the bank can receive a precharge (PRE) command to deactivate the BLSAs and to precharge BLs to be ready for subsequent activate commands, taking tRP. tRAS+tRP constitutes a DRAM cycle time called tRC. BLSAs that hold a row specified by ACT are called a row buffer of the bank. ACT/PRE are row commands whereas RD/WR are column commands. The row (page) size of a DDR4 rank is 8KB. A DRAM bank operates at a much lower clock frequency (defined to be tCCD L ) than the transfer rate of a data signal (around 2.4Gbps, which is 2b/tCK, in the latest DDR4 devices). Therefore, internal datapath of a bank is much (8×) wider than the datapath width of a DRAM device, determining burst length. For example, a ×8 DDR4 device has 64 global datalines per bank. Because tCCD L is still larger than 8 × tCK/2, 16 banks of a DDR4 device are divided into 4 bank groups where data transfers to and from different bank groups can occur consecutively in time, determining tCCD S = 4tCK.
B. BREAKING DOWN DRAM POWER DISSIPATION
DRAM dissipates most power by the following components: data read/write including inter-device signal transfers, activate/precharge to latch stored data in DRAM row buffers, refresh to retain values in leaky DRAM cells, and standby power from the DRAM internal units including delay-locked loop (DLL) that tracks the phase of master clock from a memory controller, input/output buffers, and peripheral circuits [16] , [45] . We can classify these components by whether they consume power regardless of data transfer activities or not; refresh and standby can be categorized as static, whereas activate, precharge, read, and write components as dynamic. These dynamic and static power values are presented in DRAM datasheets using I DD specifications. For example, I DD2N specifies the current of a device when it has no active pages and stays in a standby mode. A DRAM in a standby mode (e.g., I DD3N ) can receive any commands whereas if it is in a power-down mode (e.g., I DD3P ), the device must exit out of the mode to process normal commands, such as ACT, PRE, RD, and WR. A device consumes less static power when it is in a power-down mode than a standby mode. The energy efficiency of a DRAM device has been improved substantially over time. The dynamic energy of main-memory DRAM in a system depends on the frequency and characteristics of memory accesses, such as the ratios of row commands over column commands (δ), whereas its static power is influenced by the memory capacity of the system and their states, such as temperature and the average number of active banks. Both dynamic energy and static power are heavily influenced by operating voltages (the lower, the better) and fabrication technology (the narrower, the better). Figure 2 shows the key latency, dynamic energy (pJ/b), and static power (mW/Gb) values over multiple generations of ×4 DDR and ×32 LPDDR devices from manufacturer A. 1 We assume that a device operates at 85 • C. The generations and per-pin data transfer rates are denoted by (LP)DDRg-S, where g is generation and S is data transfer rate. We use δ of 0.27, the average over memory intensive SPEC CPU2006 applications reported in [24] . ACT/PRE energy is proportional to δ. We paired DDR and LPDDR devices that were/are popular at similar years. DDR3L stands for DDR3 with lower operating voltage (VDD, 1.5V for DDR3 vs. 1.35V for DDR3L).
As VDD decreases and finer-pitch fabrication technologies are introduced over generations, both dynamic energy and static power have been improved steadily. LPDDR devices consume much lower static power than DDR devices of the same generations at a given capacity as LPDDR uses transistors with higher threshold voltage, which leak less but also operate slower. Besides, LPDDR adopts more aggressive power gating techniques for internal datapath. These all make LPDDR achieve substantially lower leakage power than DDR, but at the cost of higher latency values. For example, tRC of DDR4 is 45.3ns whereas that of LPDDR4 is 60ns. Also, the primary timing parameters of DDR are reduced over time whereas those of LPDDR are growing.
III. COMPARISON OF MODERN DRAM DEVICES
We first show that numerous energy saving techniques explored for DRAM-based main memory compromise performance. Therefore, it is critical to quantify their tradeoffs using popular effectiveness metrics, such as systemlevel energy consumption and energy-delay product (EDP), as each technique has different degrees of impact on DRAM static/dynamic power. We then re-visit the ideas of exploiting mobile LPDDR devices instead of mainstream DRAM devices, which were reasonable when those were proposed, but is not anymore.
A. RECENT PROGRESSES IN IMPROVING THE ENERGY EFFICIENCY OF MAIN-MEMORY SYSTEMS
The bandwidth and latency of main-memory systems, which significantly affect the overall system performance and thus energy efficiency, are strongly dependent on the service order of memory requests. That is, sequential accesses to different rows within a bank lead to high latency and cannot be pipelined, whereas accesses to different banks or different words within a single row have low latency and can be pipelined. Therefore, memory requests can be scheduled (out of order) to maximize consecutive accesses to the same row in a bank or to different banks, which can greatly improve performance of main-memory systems [37] . With such a scheduling technique, increasing the number of banks allows a memory system to service more memory requests in parallel. This entails lower memory access latency (and thus higher system energy efficiency) but also incurs notably higher implementation cost. To cost-effectively support more parallel memory accesses, multiple sub-arrays constituting a modern DRAM bank has been exploited [25] , [42] , [51] . The sub-arrays of a bank share few global peripheral structures, but they can operate independently in most parts. Thus, different components of the bank access latencies on multiple requests can be overlapped such that they head to different sub-arrays within the same, effectively facilitating more parallel/pipelined memory accesses to each bank.
In modern DRAM, a row is typically comprised of a large number of cells . Consequently, activating and precharging a row consume significant energy. When accesses to DRAM exhibit high spatial locality, the high energy cost of activation/precharge can be amortized. However, DRAM accesses by many-core processors lack spatial locality and ensuring frequent row activations and precharges lead to significant energy inefficiency. Thus, various DRAM architectures have been devised to activate and precharge fewer cells of a row (i.e., lower energy per activation) without incurring high implementation cost [44] , [49] , [51] , [53] .
As the data transfer rate steadily goes up, DRAM I/O energy has become another significant contributor to DRAM total energy. As DRAM I/O energy is also strongly datadependent (e.g., the number of zeros or ones driven to data bus), simply counting the number of zeros (or ones) to be placed on the data bus and inverting the bit values if there are more zeros (or ones) can reduce DRAM I/O energy [4] . Besides, more bits per device lead to more energy consumption as DRAM cells should be refreshed periodically to retain their states. Because not all the DRAM cells require the same refresh frequency, various selective refresh techniques have been explored [6] , [9] , [34] , [52] .
Providing memory systems with high energy efficiency and proportionality is critical for datacenter servers because they impact cost and scalability. The past DDR DRAM focused more on high bandwidth and capacity, and was not highly optimized for energy efficiency/proportionality. To offer highly energy-efficient and -proportional memory systems for datacenter servers, the use of mobile DRAM, which was optimized for energy efficiency at the cost of increased latency and reduced bandwidth, has been proposed [31] , [48] . However, these studies did not fully consider the latency penalties listed in Figure 2 , while assuming the timing parameters in favor of LPDDR2 devices [48] ; we further elaborate these in Section III-B. Also, although various low-power modes are supported by modern DRAM, they are too slow to be used by memory systems for datacenter servers and DRAM architecture supporting fast-transition low-power modes are investigated [32] . There have been studies to categorize data by their hotness (access frequency) and to allocate/migrate them to few ranks [26] , [47] for better exploiting low-power modes, which are orthogonal to this paper.
Lastly, even if some of the aforementioned techniques improve system energy efficiency by reducing average memory access latency values, many of the DRAM static or dynamic power saving techniques impact system performance negatively. Moreover, the degrees of power saving and performance degradation heavily depend on the material-, circuit-, and architecture-level techniques of both CPU and DRAM devices. Therefore, the effectiveness of specific techniques should be carefully quantified through popular metrics, such as system energy and energy-delay product (EDP), in present and future systems, as the ideas that were valid once in the past, might not be compelling anymore.
B. DDR4 IS NOT ENERGY INEFFICIENT ANY MORE
We re-examine prior works to assess the effectiveness of utilizing low power mobile (LPDDRx) DRAM devices. Both BOOM [48] and Malladi et al. [31] advocated using unmodified LPDDR devices (LPDDR2 in their studies). LPDDR2 devices had lower per-pin data transfer rate (0.8Gbps) compared to that of DDR3 devices (1.6Gbps) with superior (lower) dynamic energy and static power values as shown in Figure 2 . Malladi et al. [31] reduce main-memory bandwidth accordingly by using LPDDR2 instead of DDR3, and reported substantial savings in both energy and total cost of ownership (TCO) on datacenter applications. Instead, BOOM [48] groups more pins to constitute a rank, increases the per-pin data transfer rate between a memory controller and modules by having a buffer chip per module, and further improves energy efficiency by leveraging rank subsetting [1] which trades higher access latency with more ranks (tailored to better exploit bank-level parallelism) and smaller row buffers.
In contrast to Malladi et al. [31] and BOOM [48] , we evaluate a modified version of LPDDR due to the following reasons. First, the per-pin data transfer rate of the latest LPDDR4 devices is not lower than that of DDR4 devices at the same generation anymore. LPDDR4-3200 devices are currently on the market, whereas DDR4-2400 is close to the fastest DDR4 devices, except ones from few overclocking vendors. Therefore, the idea of utilizing more pins per rank in BOOM is not directly applicable. Second, an LPDDRx device has wide datapath (×16 or ×32) whereas most DIMMs are equipped with ×4 or ×8 DDRx devices. The two aforementioned reasons make it difficult, if not impossible, to achieve the same degree of reliability without substantially sacrificing DRAM capacity with these wide datapath devices even through several proposed techniques [1] , [31] , [48] . Third, much better I/O energy efficiency of LPDDRx originates from the better signal integrity of mobile systems as only few DRAM devices are connected to a memory controller through a bus with a distance of up to few millimeters. Therefore, buffer chips are must for LPDDR-based memory modules such as BOOM, which increases access latency and power, whereas DDRx-based memory modules can dispense with buffer chips when the number of banks per memory channel is low. Fourth, the burst length of LPDDR4 is 16 whereas that of DDR4 is 8. Longer burst length hurts the performance of applications with the low spatial locality in memory accesses. We model the modified version of LPDDR4, what we call LPDDR4' hereafter, as follows; basically, LPDDR4' uses the material-and circuit-level technologies of LPDDR4 (except I/O) and adopts the micro-architectural features of DDR4, such as datapath width, row buffer size, and burst length.
Meanwhile, the energy efficiency, especially the static power of mainstream DDRx devices has been improved substantially over time. Figure 3 shows the power breakdown of DDR2/3/3L/4 DRAM ranks that are sold as of September 2017. We collected the values from three major DRAM manufacturers, distinguished by A [33] , B [38] , and C [40] . Multiple columns from the same manufacturer for a single DRAM standard stand for different revisions (e.g., process generations), resulting in better energy efficiency over time. We report the static power of 8 ranks connected in a channel, and reflect the I/O power accordingly; except DDR2, we use load-reduced DIMM (LRDIMM) to connect 8 ranks, which increases I/O power due to the buffer chips in LRDIMM. ×4 devices are used. The capacity of a DDR2 device is 2Gb, which is the maximum size being sold, whereas that of other devices is 4Gb. We assume that each device transfers data at its highest rate and the ratio of ACT over RD commands (δ) is 0.27, the value used in Section II-B as well. Power breakdown of DDR2/3/3L/4 DRAM ranks sold September 2017 from three major manufacturers (A, B, and C). We downloaded datasheets from DRAM vendors' web-page, which are publicly available. We report the static power of 8 ranks connected at a channel, reflecting the I/O power accordingly.
We make the following key observations from Figure 3 . First, supply voltage levels decrease as newer standards are introduced (1.8V/1.5V/1.35V/1.2V for DDR2/3/3L/4) and hence DDR4 is most energy efficient, reinforcing the observations made of Figure 2 . Second, material-, fabrication-, and circuit-level technologies make huge variation in power within and across DRAM manufacturers. This variation is more prominent for the static power of DDR4 devices; a device from A consumes more than twice the static power compared to those from B and C. Multiple factors contribute to this huge difference. For example, delay-locked loops (DLLs) in DRAM are traditionally implemented using analog circuits, occupying a considerable fraction of the static power of DDR2/3 devices. The introduction of digital DLLs, enabling DLL to be turned off most of the time and just periodically to re-calibrate reference clock phases [27] , is VOLUME 6, 2018 conjectured to a substantial reduction in DLL power of certain manufacturers.
These material-, fabrication-, and circuit-level evolutions narrow the gap between the current DDR4 and LPDDR4' devices. As shown in Figure 3 and Table 3 , the DDR4 devices from A and B vendors consume 2.3× and 1.4× more static power than the LPDDR4' device. We test six configurations using the following combinations; 2 and 8 ranks per memory channel, DDR4 from A, B, and LPDDR4'. The baseline is DDR4 from B. The reference CMP configuration is specified in Section VI. The performance penalties of using LPDDR4' instead of DDR4 on the CMP for memory-intensive multiprogrammed workloads are 28% and 34% for 2 and 8 rank cases, respectively (details in Section VII). This means that LPDDR4' is more energy efficient than DDR4 only for high-capacity (8 ranks per channel) servers equipped with power-hungry DDR4 from A. Even this configuration is more efficient than the one using LPDDR4' in system-level EDP.
IV. ENERGY EFFICIENCY AND PERFORMANCE TRADE-OFFS OF MODERN MAIN-MEMORY DEVICES
In this section, we assess the primary energy saving techniques introduced at the latest DDR4 devices and propose novel techniques to better exploit DRAM power-down modes and data bus inversion (DBI).
A. SAVING DATA TRANSFER ENERGY WITH DBI/TSV
Data bus inversion (DBI), which has been used for graphics [17] memory, is introduced to mainstream DDR4 DRAM. There are three components consuming energy in transferring data between CPU and DRAM devices. First, the drivers of a transmitter and the on-die termination (ODT) resistor of a receiver consume DC energy (E DC ). Second, AC energy (E AC ) is consumed while data bus toggles between ones and zeros. E DC is inversely proportional to the channel resistance, whereas E AC is proportional to the data transfer rate, the channel capacitance, and the bus toggling rate. Typically, high voltage (VDDQ) represents data one and ground does data zero. As DDR4 adopts pseudo open drain (POD) interface (Figure 4(a) ), it consumes DC power only when transferring data zero. The last is energy consumed by the components within DRAM devices (E INT ), such as inter-bank/global/local datalines, which is mostly the same regardless of the value being transferred, whereas the first two I/O components are data-value dependent. Therefore, total data transfer energy (E TR ) 2 is represented by E TR = γ DC E DC + γ AC E AC + E INT , where γ DC is the probability of sending value zeros and γ AC is the probability of consecutive data being toggled. When random data values are transferred, both γ DC and γ AC are 0.5.
1) BENEFITS OF DBI
In a DDR4-2400 device, the data I/O consumes 46% of total dynamic power when it transfers data at peak bandwidth ( Figure 3) . Therefore, reducing data I/O energy can be as crucial as saving DRAM static power, especially for microservers with just few ranks per memory channel. DBI in a DDR4 DRAM device counts the number of zeros on a data bus and flips them if zeros are a majority, reducing the frequency of zero signals. The size of a group is equal to the datapath width of a DRAM device in DDR4 (e.g., 8 bits for ×8 devices). DBI reduces both the portion of zero values (lower γ DC ) and the frequency of data toggling (lower γ AC ). Throughout a Monte Carlo simulation which transfers a million random numbers, we observed that both γ DC and γ AC decrease with DBI (see Figure 4(b) ). As the size of a DBI group decreases, both probability values further decrease. Between the AC and DC components, γ DC values are more sensitive to the DBI group size.
2) ENERGY SAVINGS BY DBI CONSIDERING ITS COST
However, this reduced probability does not directly translate to the equivalent degree of DRAM energy saving because the cost of transferring information about whether data values are flipped or not should also be considered. The additional DBI pin needed consumes both DC and AC energy. Figure 4(c) shows the DRAM dynamic energy breakdown with this overhead considered for the cases of 2 and 8 ranks per channel. With few (two) ranks in the channel, E DC is much higher than E AC . The cost of the DBI pin is amortized as DRAM datapath width increases, but its benefit diminishes for larger datapath widths, making DBI more efficient in data transfer energy for ×8 and ×16 devices. Combined with the fact that more pins in CPU induce higher cost premium, DDR4 does not support DBI for ×4 devices. Even if a ×8 device saves data transfer energy by 4.1%, the latency penalty of DBI and the resulting performance degradation should be considered carefully. DBI increases tCL, read command to first data out time, by 3tCK. As system performance is most sensitive to tCL among the timing parameters of DRAM, a small improvement in transfer energy can be negated by additional energy consumed due to increased execution time, as shown in Figure 7 for 2 rank cases.
3) IMPACT OF MODULE TYPES
Registered DIMM (RDIMM) repeats command/address signals with a buffer. RDIMM is a must to servers because the number of attached DRAM devices per channel often surpasses several dozens, even reaching a few hundreds. When a channel has more ranks, the signal integrity of data I/Os gets worsened as well, enforcing the channel to operate at lower data transfer rates. Load-Reduced DIMM (LRDIMM) has data buffer (DB) chips placed between its DRAM device and a memory controller outside of the module. These DB chips reduce channel load seen by both the controller and DRAM devices and hence increase data transfer rates compared to the modules without them [20] . Adding data buffers increases all three components of data transfer energy as data I/Os are repeated and a data buffer itself consumes energy internally for re-timing signals regardless of the repeated values. For example, for a channel with 2 DIMMs and 4 ranks per DIMM (8 ranks total), the E AC increases by several times compared to the two-rank case reflecting the deteriorated signal integrity (Figure 4) . Therefore, compared to the tworank case, the absolute amount of energy saved by the DC/AC energy components is increased. The static power consumed by the DB chips is much lower than that of DRAM chips and not presented in Figure 4(c) .
Recently, TSV-RDIMM [36] is introduced as an alternative to LRDIMM. It 3D-stacks multiple (4 or 8) DDR4 dies and packages them as a single chip. Each chip has a master die, which serves the role of a data buffer as well. This buffering increases tCL of TSV-RDIMM by 2tCK, which is equal to the overhead due to the data buffer in LRDIMM. However, TSV-RDIMM has following advantages. First, data buffers repeat signals at a package level whereas the master die repeats signals to/from TSVs through micro-bumps. Package-level repeating consumes more power because pads and bumps have higher impedance values than TSVs and micro-bumps. From the data I/O perspective, TSV-RDIMM makes the cost of an eight-rank configuration the same as the two-rank case without data buffers, becoming more energy efficient than LRDIMM. Second, only one DLL and I/O buffers are needed per package, amortizing their power overheads. Third, because all the dies within a package are locked in the clock, tRTRS within the package is 0tCK. This is useful because a server memory channel typically has several ranks and non-zero tRTRS values hurt random access performance.
B. SAVING STANDBY POWER BY EXPLOITING POWER-DOWN MODES
Instead of adopting the material-and circuit-level techniques of LPDDR4 which incur high latency penalty but provide insufficient power saving, we pay more attention to the energy saving techniques introduced at DDR4. We first exploit the power-down (PD) mode, which can save DRAM static power. Big-memory servers have several DRAM ranks per memory channel. Because only one rank can be involved in data transfer at any given time on a channel, we should put these remaining ranks in a PD mode as often as possible with a minimal performance impact. Although static power has decreased substantially on recent DDR3L/4, it is still above half of the total DRAM power when eight ranks are populated in a channel. When systems do not utilize main memory at peak bandwidth, static power saving is even more important.
A DDR4 device enters and exits a conventional PD mode by toggling its CKE pin. Entering a PD mode deactivates the input/output (I/O) buffers and power-gates internal datapath (inter-bank and global datalines) of a DRAM device. Compared to a device in precharge standby mode (i.e., all banks stay precharged but are ready to accept any command (cf. I DD2N in Table 1 )), one in precharge PD mode (where all banks also stay precharged but cannot accept any command except PD exit) consumes 40% less power in DDR4 made by B (cf. I DD2P in Table 1 ). A device in a PD mode has following constraints. First, once entered, it should stay in the PD mode for a certain time period (tCKE, 5ns for DDR4-2400). Second, a device needs to wait for tXP (6ns for DDR4-2400) to receive any valid command after receiving the PD exit command. If DLL is frozen to save more static power, a device needs more time than tXP to receive a RD command as DLL must be locked again, called slowexit mode (tXARD/tXPDLL for DDR2/3). Due to improvement in DLL circuitry, however, DDR4 does not support the slow-exit mode as DLL power has decreased substantially. For example, as shown in Figure 3 , DDR4-2400 from B consumes 0.52W for DLL whereas DDR3-1600 from A does as much as 3.7W for DLL, a substantial shrink considering higher data transfer rate.
DDR4 supports an alternative static power saving scheme called command address latency (CAL). CAL turns off the I/O buffers of a device by default and turns them on only when a command is issued to the device. Because the I/O buffers should receive any valid command, CAL exploits the CS (chip select) pin to notify the device a few cycles ahead for a normal command (tCAL, 5tCK for DDR4-2400). Therefore, CAL increases the latency of any command by tCAL, but in turn allows a DRAM device to stay at a low-power state as long as possible. This is in contrast to the conventional toggle-based PD mode which has latency penalties only to the first command after a PD exit but imposes a burden of explicitly specifying when to enter the PD mode to a memory controller. Besides, to facilitate short tCAL (i.e., smaller than tCKE+tXP), CAL does not power-gate peripheral circuitry, entailing less power saving than the conventional PD mode (I DD2NL vs. I DD2P ).
V. IMPROVING EFFICIENCY WITHOUT PERFORMANCE DROP: EXPLOITING POWER-DOWN ADAPTIVELY
There have been proposals to exploit the power-down (PD) mode for saving DRAM static power, but with limited success. Entering/exiting the PD mode for every command causes excessive performance degradation due to the tCKE and tXP constraints explained in Section IV-B. Hur and Lin [11] suggested enforcing a rank to stay in a standby mode at least for a specified period (time-out) utilizing a per-rank counter, which resets on every command to the rank. Even if the counter expires, the rank does not enter the PD mode if there is any pending request to the rank in the memory controller. However, details of specifying its duration are missing in [11] . Ahn et al. [1] suggested making a DRAM rank enter a PD mode when all the banks are at the precharge state. Although reasonable, it is applicable only to the closed-page management scheme, not even considering more recent adaptive schemes [13] , [22] .
In this paper, we propose a simple but effective scheme to better exploit the PD mode with a minimal performance penalty. It adaptively changes the time-out value (λ) of Hur et al. [11] based on the access history of a rank ( Figure 5 ). This per-rank epoch-based scheme counts the number of PD exits ( ). If is above a certain threshold (θ hi ), it means that the rank enters the PD state too hastily, and hence the scheme increases λ by at the next epoch. If is below (θ lo ), it is likely that the rank exits the PD state too slowly, so the scheme decreases λ by at the next epoch. Otherwise, λ stays unchanged. λ changes within the range of (λ min , λ max ). These rules are based on the following observations. Because there exists correlation between memory access patterns over time, adaptive memory scheduling policies [13] , [14] are effective and so is this history-based PD management scheme. When a rank is busy serving requests, it is unlikely that the rank has no pending request. When it is mostly idle, it is better to stay in a PD mode. For both cases, the rank enters/exits the PD mode infrequently, and it is better not to increase λ. By making θ hi larger than θ lo , we can make oscillate less frequently. If λ goes up or down too far, it cannot return to its optimal value quickly on memory access pattern changes. Therefore, the range of λ (λ min , λ max ) is required. The implementation cost of the proposed scheme, called Ad-PD, is low. In addition to [11] , one more register is needed per rank to hold and one register per channel to set an epoch. Throughout the extensive simulation, we empirically set (epoch interval, λ min , λ max , θ lo , θ hi , ) as (20us, 30ns, 2us, 2, 5, 10ns) on the CMP system specified in Section VI. Unlike the proposal in [11] which uses a fixed and predetermined lambda, our scheme changes it dynamically over time. Our proposal is also different from RAMZzz [47] in that RAMZzz collects the histogram of idle periods (interval between two commands) over a much longer epoch (in the order of dozens of milliseconds) to adjust λ. While Ad-PD has much lower hardware complexity (a counter) compared to RAMZzz (80KB storage) per rank, Ad-PD tracks changes in memory access behaviors more nimbly and effectively improves the energy efficiency of the evaluated system.
VI. EXPERIMENTAL SETUP
We simulated a chip-multiprocessor (CMP) system with DDR4 from two manufacturers (A and B) and the modified LPDDR4 (LP4') to evaluate their performance and energy efficiency on multi-programmed and multi-threaded workloads; DDR4 from C and B has similar power consumption. Table 2 tabulates the default parameters of the simulated system. DRAM timing, dynamic energy, and static power values are listed in Table 3 . Each ECC DIMM uses ×4 DRAM devices, where their per-pin data transfer rate is 2400Mbps. LP4' was modeled following the methodology described in Section II-B and III-B; its VDD is 1.1V, same as LPDDR4, whereas it uses the I/O of DDR4 and (datapath width, page size) are (×4, 512 bits), instead of (×16 2,048 bits). RDIMMs are used for 2-rank configuration, and LRDIMMs or TSV-RDIMMs are used for 8-rank configuration. Dynamic energy and static power of B-TSV are estimated based on B with the overheads (e.g., additional FIGURE 6. Relative IPC (higher is better) and EDP (lower is better) as well as power breakdown of multi-programmed and multi-threaded workloads on the simulated chip multiprocessor systems with DDR4 from A, B, B with TSV-RDIMM (B-TSV), and LPDDR4' (LP4'). We set B as baseline for a given application and ranks per channel. [36] from the manufacturer B, while LP4' is the modified LPDDR4. P standby is the standby power of one DRAM rank. data transfer energy through TSVs) detailed in [36] applied. A modified version of McPAT [29] was used for modeling a CMP fabricated at the 14nm technology, where the processor dissipates 25W at idle. We modified McSimA+ [2] to support the various power-down (PD) modes including CAL and the adaptive PD scheme.
SPEC CPU2006 [10] benchmark suite was used for multi-programmed workloads. We used Simpoint [39] to identify and use the most representative simulation point of each application, which consists of 100M instructions. We categorized the SPEC applications based on the memory access per kilo instruction values and composed two mixes based on their memory bandwidth demands; mixhigh consists of two instances of mcf, milc, leslie3d, soplex, GemsFDTD, libquantum, and lbm, and one instance of omnetpp and sphinx3; mix-blend selects 16 applications randomly and assigns one instance each to cores from perlbench, bzip2, gobmk, dealII, bwaves, zeusmp, sjeng, h264ref, astar, xalancbmk, mcf, milc, GemsFDTD, lbm, omnetpp, and sphinx3. We reported aggregate IPC for multi-programmed workloads as they closely tracked the weighted speedup [41] values. For multi-threaded workloads, we ran the regions of interest of MICA [30] (a key-value store), fluidanimate in PARSEC [7] , and LU in SPLASH-2X [46] . MICA is configured to run at the exclusive read/write and full LRU mode with evenly distributed 128B keys and 1024B values. LU and fluidanimate use simlarge datasets.
VII. EVALUATION
We evaluate the performance (IPC) and energy efficiency (energy-delay product (EDP)) of exploiting low-power mobile DRAM technologies, 3D stacking, various powerdown modes for static power saving, and data bus inversion (DBI) for dynamic energy saving using multi-programmed and multi-threaded workloads on the simulated CMP systems. Figure 6 shows the relative IPC, EDP, and power breakdown of the workloads with DDR4 from A, B, B with TSV-RDIMM (B-TSV), and LPDDR4' (LP4'). We make the following key observations. First, the system with more power-efficient DDR4 from B is superior to A in EDP (lower is better) over the tested workloads. B and A have the same timing values, so the more energy-efficient, the better EDP. When 8 ranks are populated per channel, this gap in EDP is wider than the 2 rank systems because A dissipates larger static power from DRAM devices with 8 ranks per channel. Second, compared to the system with less power-efficient DDR4 from A, the system with LP4' is consistently worse in EDP except for a multi-threaded workload LU on the 8 rank systems. Even if LP4' dissipates lower power than A and B, it performs worse than DDR4 due to larger timing parameter values. The impact of lower performance of LP4' on EDP is larger than the difference in power consumption. Therefore, LP4' does not provide better EDP than DDR4 from both A and B. Third, lowering DRAM dynamic energy by utilizing TSV-RDIMM is effective. B-TSV consumes less power than already energy-efficient B. TSV-RDIMM is more effective on the 8-rank configuration because the other DRAM devices are augmented with data buffer (DB) chips to retain the data transfer rate at the worse signal integrity, which increases DRAM static power noticeably. Moreover, TSV-RDIMM performs better as well because it experiences no tRTRS penalty in memory channel ownership changes among ranks that are stacked together.
The evaluated power-down (PD) schemes are effective in reducing DRAM power consumption, among which CAL dissipates the smallest power, whereas the proposed adaptive PD scheme (Ad-PD) achieves the best (lowest) EDP. [11] (Hur-PD), and B-TSV with the proposed PD scheme (Ad-PD). CAL makes a DRAM device stay in a PD mode (I DD2NL ) most frequently as it turns the input/output buffers on only when commands are delivered to the device, reducing the DRAM power most. However, CAL makes every DRAM command experience an additional delay of tCAL. PD enforces a rank to stay at a standby mode where it consumes more static power but can receive commands without latency penalty anytime the corresponding controller has at least a pending request to the rank [1] . This reduces average access latency for memory intensive workloads, such as mix-high, compared to CAL as PD experiences the PD exit penalty less frequently. On the contrary, PD suffers from PD enter/exit latency penalty, which is several times higher than tCAL, on applications with light/medium bandwidth demands (e.g., mix-blend and LU). Therefore, there is no clear winner between PD and CAL in EDP. Ad-PD performs better than PD as the former suffers less frequently from the hasty PD entries, and results in better EDP for the tested applications. Ad-PD is better than B-TSV, CAL, and PD in EDP on mix-high by 4.0%, 4.0%, and 3.9%, and on LU of SPLASH-2X by 6.6%, 4.1%, and 6.9%.
Exploiting DBI techniques are not effective in EDP with the latency penalty specified in the DDR4 standard. We used the two-rank configurations as DRAM dynamic power takes more portion of total system power with fewer populated ranks. Even if the DDR4 standard does not define DBI for ×4 devices, we assume that DBI is implemented with an additional DBI pin per DRAM device and denote it by DBI. Recently, Song and Ipek [43] proposed More is Less, which utilizes a bandwidth-inefficient but energy-efficient DBI code when channels are lightly loaded, and another bandwidth-efficient but less-energy-efficient code for heavily loaded cases. To understand the upper-bound of its energy savings, we model the I/O energy of the energy-efficient (3-LWC [43] , only up to three zeros in a 8-bit group of data burst) code and the bandwidth penalty of the bandwidthefficient (MiLC [43] , burst length 10 instead of 8) code, and denote it by DBI-MiL. The latency penalty in tCL is 3tCK for both DBI and DBI-MiL. As shown in Figure 7 (right), DBI and DBI-MiL achieve higher (worse) EDP values than the baseline, while TSV-RDIMM without DBI (TSV) performs best. Both lower DRAM power, but performance penalty due to increased tCL outweighs the power saving, leading to worse EDP. DBI-MiL further reduces DRAM I/O energy compared to DBI, but its longer burst length exacerbates performance for memory intensive workloads such as mixhigh, whereas I/O energy saving takes a small portion of system energy for applications with medium to low memory bandwidth demands.
VIII. CONCLUSION
Mainstream DRAMs for servers/desktops have adopted the advantages of fabrication technologies, circuit techniques, and microarchitectures used by popular graphics or mobile DRAMs. Based on this observation, we demonstrated that the prior proposal applying mobile DRAMs to big-memory servers becomes ineffective due to insufficient energy saving over performance penalty that increases the energy consumption of other system components such as CPU. Thus, we paid more attention to other energy saving techniques introduced by the latest DDR4. Especially, we found that the data transfer energy saving by data bus inversion (DBI) does not overcome the energy overhead induced by performance penalty, whereas exploiting power-down (PD) modes pays off the cost of PD entrance/exit latencies as it reduces DRAM standby power, a major portion of DRAM power consumption for big-memory servers. Subsequently, we proposed simple but effective PD scheme and improved system-level energy-delay product by 4.0% over the default PD schemes on memoryintensive multi-programmed workloads. Lastly, we analyzed and quantified the benefits of combining our proposals with TSV-RDIMM on performance and energy efficiency for bigmemory servers.
