This paper summarizes the idea of Tiered-Latency DRAM (TL-DRAM), which was published in HPCA 2013 [73] , and examines the work's signi cance and future potential. The capacity and cost-per-bit of DRAM have historically scaled to satisfy the needs of increasingly large and complex computer systems. However, DRAM latency has remained almost constant, making memory latency the performance bottleneck in today's systems. We observe that the high access latency is not intrinsic to DRAM, but a trade-o is made to decrease the cost per bit. To mitigate the high area overhead of DRAM sensing structures, commodity DRAMs connect many DRAM cells to each sense ampli er through a wire called a bitline. These bitlines have a high parasitic capacitance due to their long length, and this bitline capacitance is the dominant source of DRAM latency. Specialized low-latency DRAMs use shorter bitlines with fewer cells, but have a higher cost-per-bit due to greater sense ampli er area overhead. To achieve both low latency and low cost per bit, we introduce Tiered-Latency DRAM (TL-DRAM). In TL-DRAM, each long bitline is split into two shorter segments by an isolation transistor, allowing one of the two segments to be accessed with the latency of a short-bitline DRAM without incurring a high cost per bit. We propose mechanisms that use the low-latency segment as a hardware-managed or software-managed cache. Our evaluations show that our proposed mechanisms improve both performance and energy e ciency for both single-core and multiprogrammed workloads.
This paper summarizes the idea of Tiered-Latency DRAM (TL-DRAM), which was published in HPCA 2013 [73] , and examines the work's signi cance and future potential. The capacity and cost-per-bit of DRAM have historically scaled to satisfy the needs of increasingly large and complex computer systems. However, DRAM latency has remained almost constant, making memory latency the performance bottleneck in today's systems. We observe that the high access latency is not intrinsic to DRAM, but a trade-o is made to decrease the cost per bit. To mitigate the high area overhead of DRAM sensing structures, commodity DRAMs connect many DRAM cells to each sense ampli er through a wire called a bitline. These bitlines have a high parasitic capacitance due to their long length, and this bitline capacitance is the dominant source of DRAM latency. Specialized low-latency DRAMs use shorter bitlines with fewer cells, but have a higher cost-per-bit due to greater sense ampli er area overhead.
To achieve both low latency and low cost per bit, we introduce Tiered-Latency DRAM (TL-DRAM). In TL-DRAM, each long bitline is split into two shorter segments by an isolation transistor, allowing one of the two segments to be accessed with the latency of a short-bitline DRAM without incurring a high cost per bit. We propose mechanisms that use the low-latency segment as a hardware-managed or software-managed cache. Our evaluations show that our proposed mechanisms improve both performance and energy e ciency for both single-core and multiprogrammed workloads.
Tiered-Latency DRAM has inspired several other works on reducing DRAM latency with little to no architectural modication [20, 21, 22, 24, 37, 38, 68, 72, 116, 117, 118 ].
Problem: High DRAM Latency
Primarily due to its low cost per bit, DRAM has long been the substrate of choice for architecting main memory subsystems. In fact, DRAM's cost per bit has been decreasing at a rapid rate as DRAM process technology scales to integrate ever more DRAM cells into the same die area. As a result, each successive generation of DRAM has enabled increasingly larger-capacity main memory subsystems at low cost.
In stark contrast to the continued scaling of cost per bit, the latency of DRAM has remained almost constant. During the same 11-year interval in which DRAM's cost per bit decreased by a factor of 16, DRAM latency (as measured by the t RCD and t RC timing constraints) 1 decreased by only 30.5% and 26.3% [6, 47] , respectively, as shown in Figure 1 . From the perspective of the processor, an access to DRAM takes hundreds of cycles -time during which the processor may be stalled, waiting for DRAM [3, 34, 48, 92, 93, 96] . This wasted time due to stalling on DRAM leads to large performance degradation. Change in DRAM capacity and latency over time [6, 47, 100, 111] . Reproduced from [73] .
Key Observations and Our Goal
Bitline: Dominant Source of Latency. In DRAM, each bit is represented as electrical charge in a capacitor-based cell. The small size of this capacitor necessitates the use of an auxiliary structure, called a sense ampli er, to (1) detect the small amount of charge held by the cell and (2) amplify it to a full digital logic value. A sense ampli er is approximately one hundred times larger than a cell [107] . To amortize their large size, each sense ampli er is connected to many DRAM cells through a wire called a bitline. 2 Every bitline has an associated parasitic capacitance, whose value is proportional to the length of the bitline. Unfortunately, the parasitic capacitance slows down DRAM operation for two reasons. First, it increases the latency of the sense ampli ers. When the parasitic capacitance is large, a cell cannot quickly create a voltage perturbation on the bitline that can be easily detected by the sense ampli er. Second, the capacitance increases the latency of charging and precharging the bitlines. 1 The overall DRAM latency can be decomposed into individual DRAM timing constraints. Two of the most important timing constraints are t RCD (row-to-column delay) and t RC (row-cycle time). 2 We refer the reader to our prior works for a detailed background on DRAM architecture and operation [21, 22, 23, 24, 37, 38, 54, 56, 57, 58, 59, 60, 68, 69, 71, 72, 73, 75, 76, 99, 103, 116, 117] .
Although the cell and the bitline must be restored to their quiescent voltages during and after an access to a cell, such a procedure takes much longer when the parasitic capacitance of the bitline is large. Due to these two reasons, and based on a detailed latency breakdown discussed in Section 3.1 of our HPCA 2013 paper [73] , we conclude that long bitlines are the dominant source of DRAM latency [44, 72, 73, 90, 91, 122] .
Latency vs. Cost Trade-O . The bitline length is a key design parameter that exposes the important trade-o between latency and die size (cost). Short bitlines (i.e., a bitline connected to only a few cells) constitute a small electrical load (parasitic capacitance), which leads to low latency. However, they require more sense ampli ers for a given DRAM capacity (Figure 2a) , which leads to a large die size. In contrast, long bitlines have high latency and a small die size ( Figure 2b) . As a result, neither of these two approaches can optimize for both latency and cost per bit. [73] . Figure 3 shows the trade-o between DRAM latency and die size by plotting the latency (t RCD and t RC ) and the die size for di erent values of cells per bitline. Existing DRAM architectures are either (1) optimized for die size (e.g., commodity DDR3 [86, 111] ) and are thus low cost but high latency; or (2) optimized for latency (e.g., RLDRAM [85] , FCRAM [112] ) and are thus low latency but (very) high cost. (168) 128 (114) 256 (87) 512 ( The goal of our HPCA 2013 paper [73] is to design a new DRAM architecture to approximate the best of both worlds (i.e., low latency and low cost), based on our key observation that long bitlines are the dominant source of DRAM latency.
Tiered-Latency DRAM
To achieve the latency advantage of short bitlines and the cost advantage of long bitlines, we propose the Tiered-Latency DRAM (TL-DRAM) architecture, which is shown in Figures 2c  and 4a . The key idea of TL-DRAM is to divide the long bitline into two shorter segments using an isolation transistor: the near segment (connected directly to the sense ampli er) and the far segment (connected through the isolation transistor). The primary role of the isolation transistor is to electrically decouple the two segments from each other. This changes the e ective bitline length (and also the e ective bitline capacitance) as seen by the cell and sense ampli er. Correspondingly, the latency to access a cell also changes, albeit di erently depending on whether the cell is in the near or the far segment.
When accessing a cell in the near segment, the isolation transistor is turned o , disconnecting the far segment (Figure 4b ). Since the cell and the sense ampli er see only the reduced bitline capacitance of the shortened near segment, they can drive the bitline voltage more easily. As a result, the bitline voltage is restored more quickly, and, thus, the latency (t RC ) for the near segment is signi cantly reduced. On the other hand, when accessing a cell in the far segment, the isolation transistor is turned on to connect the entire length of the bitline to the sense ampli er. In this case, the isolation transistor acts like a resistor inserted between the two segments ( Figure 4c ) and limits how quickly charge ows to the far segment. Because the far segment capacitance is charged more slowly, it takes longer for the far segment voltage to be restored, and, thus, the latency (t RC ) is increased for cells in the far segment.
Sensitivity to Segment Length. The lengths of the two segments are determined by where the isolation transistor is placed on the bitline. Assuming that the number of cells per bitline is xed at 512 cells, the near segment length can range from as short as a single cell to as long as 511 cells. We perform circuit-level simulations to determine how the latency of each segment based on the number of cell in the segment. Figures 5a and 5b plot the latencies of the near and far segments as a function of their length, respectively. For reference, the rightmost bars in each gure are the latencies of an unsegmented long bitline whose length is 512 cells. From these gures, we draw three conclusions. First, the shorter the near segment, the lower its latencies (t RCD and t RC ). This is expected since a shorter near segment has a lower e ective bitline capacitance, allowing it to be driven to target voltages more quickly. Second, the longer the far segment, the lower the far segment's t RCD . Recall from our previous discussion that the far segment's t RCD depends on how quickly the near segment (not the far segment) can be driven. A longer far segment implies a shorter near segment (lower capacitance), which is why t RCD decreases for the far segment. Third, the shorter the far segment, the smaller its t RC . The far segment's t RC is determined by how quickly it reaches the full voltage (V DD or 0). Regardless of the length of the far segment or the near segment, the current that trickles into it through the isolation transistor does not change signi cantly. Therefore, a shorter far segment (lower capacitance) reaches the full voltage more quickly.
Latency Analysis (Circuit Evaluation). We model TL-DRAM in detail using SPICE simulations. Simulation parameters are mostly derived from a publicly available 55nm DDR3 2Gb process technology le [107] which includes information such as cell and bitline capacitances and resistances, physical oorplanning, and transistor dimensions. Transistor device characteristics were derived from [98] and scaled to agree with [107] . Figures 6 and 7 show the bitline voltages during activation and precharging, respectively. The x-axis origin (time 0) in the two gures corresponds to when the subarray receives the ACTIVATE or PRECHARGE command, respectively. In addition to the voltages of the segmented bitline (near and far segments), the gures also show the voltages of two unsegmented bitlines (short and long) for reference.
First, during an access to a cell in the near segment (Figure 6a) , the far segment is disconnected and is oating (hence its voltage is not shown). The bitline starts at 1/2 V DD . Due to the reduced bitline capacitance of the near segment, its voltage increases almost as quickly as the voltage of a short bitline (the two curves are overlapped) during sensing & ampli cation. Since the near segment voltage reaches 0.75V DD and V DD (the threshold and restored states, respectively) quickly, its t RCD and t RAS , respectively, are signi cantly reduced compared to a long bitline. Second, during an access to a cell in the far segment (Figure 6b ), we can indeed verify that the voltages of the near and the far segments increase at di erent rates due to the resistance of the isolation transistor, as previously explained. Compared to a long bitline, while the near segment voltage reaches 0.75V DD more quickly, the far segment voltage reaches V DD more slowly. As a result, t RCD for the far segment is reduced while its t RAS is increased.
While precharging the bitline after accessing a cell in the near segment (Figure 7a ), the near segment reaches 0.5V DD quickly due to the smaller capacitance, almost as quickly as the short bitline (the two curves are overlapped). On the other hand, precharging the bitline after accessing a cell in the far segment (Figure 7b ) takes longer compared to the long-bitline baseline. As a result, t RP is reduced for the near segment and increased for the far segment.
Summary (Latency, Power, and Die-Area). Table 1 summarizes the latency, power, and die area characteristics of TL-DRAM compared to short-bitline and long-bitline DRAMs, estimated using circuit-level SPICE simulation [98] and power/area models from Rambus [107] . Compared to commodity DRAM (long bitlines), which incurs high latency (t RC ) for all cells, TL-DRAM o ers signi cantly reduced latency (t RC ) for cells in the near segment, while increasing the latency for cells in the far segment due to the additional resistance of the isolation transistor. In DRAM, a large fraction of the power is consumed by the bitlines. Since the near segment in TL-DRAM has a lower capacitance, it also consumes less power. On the other hand, accessing the far segment requires toggling the isolation transistors, leading to increased power consumption. Mainly due to additional isolation transistors, TL-DRAM increases die area by 3% compared to commodity DRAM. Section 4 of our HPCA 2013 paper [73] includes detailed circuit-level analyses of TL-DRAM, along with detailed area, latency, and power estimations. 
Leveraging TL-DRAM
TL-DRAM enables the design of many new memory management policies that exploit the asymmetric latency characteristics of the near and the far segments. Section 5 of our HPCA 2013 paper [73] describes four mechanisms that take advantage of TL-DRAM. Here, we describe two approaches in particular.
In the rst approach, the memory controller uses the near segment as a hardware-managed cache for the far segment. In our HPCA 2013 paper [73] , we discuss three policies for managing the near segment cache. The three policies di er in deciding when a row in the far segment is cached into the near segment and when the row is evicted. In addition, we propose a new data transfer mechanism (Inter-Segment Data Transfer) that e ciently migrates data between the segments by taking advantage of the fact that the bitline is a bus connected to the cells in both segments. By using this technique, the data from the source row can be transferred to the destination row over the bitlines at very low latency (additional 4ns over t RC ). 3 Furthermore, this Inter-Segment Data Transfer happens exclusively within a DRAM bank without utilizing the DRAM channel, allowing concurrent accesses to other banks.
In the second approach, the near segment capacity is exposed to the OS, enabling the OS to use the full DRAM capacity. We propose two concrete mechanisms, one where the memory controller uses an additional layer of indirection to map frequently-accessed pages to the near segment, and another where the OS uses static/dynamic pro ling to directly map frequently-accessed pages to the near segment.
In both approaches, the accesses to pages that are mapped to the near segment are served faster and with lower power than in conventional DRAM, resulting in improved system performance and energy e ciency.
We refer the reader to Section 5 of our HPCA 2013 paper [73] for a full description of use cases for TL-DRAM. Note that a very wide variety of techniques developed for cache management [105, 115, 119, 120, 132] can be adopted to manage the near segment in TL-DRAM.
Performance and Power Evaluation
Section 8 of our HPCA 2013 paper [73] provides a detailed evaluation of all of the above approaches to leverage TL-DRAM. Here, we present the evaluation results for only the rst approach, in which the near segment is used as a hardware-managed cache managed under our best policy (Bene t-Based Caching), to demonstrate the advantages of our TL-DRAM substrate.
Methodology. To evaluate our mechanism, we use Ramulator [56, 110] , an open-source DRAM simulator, which is integrated into an in-house processor simulator. The released version of Ramulator [110] provides a model for TL-DRAM, which we hope future works use and build upon. A detailed methodology can be found in Section 7 of our HPCA 2013 paper [73] .
Performance & Power Analysis. Figure 8 shows the average performance improvement and power e ciency of our proposed mechanism over the baseline with conventional DRAM, on 1-, 2-and 4-core systems. As described in Section 3, the access latency and power consumption are signi cantly lower for near segment accesses, but higher for far segment accesses, compared to accesses in a conventional DRAM. We observe that a large fraction (over 90% on average) of requests hit in the rows cached in the near segment, thereby accessing the near segment with low latency and low power consumption. As a result, TL-DRAM achieves signi cant performance improvements of 12.8%/12.3%/11.0%, and power savings of 23.6%/26.4%/28.6% in 1-/2-/4-core systems, respectively. Sensitivity to Near Segment Capacity. The number of rows in the near segment presents a trade-o , since increasing the near segment's size increases its capacity but also increases its access latency. Figure 9 shows the performance improvement of our proposed mechanisms over the baseline as we vary the near segment size. Initially, performance im-proves as the number of rows in the near segment increases, since more data can be cached. However, increasing the number of rows in the near segment beyond 32 reduces the performance bene t due to the increased capacitance and hence the higher near segment access latencies. Other Results. In our HPCA 2013 paper [73] , we provide a detailed analysis of how timing parameters and power consumption vary when varying the near segment length (Sections 4 and 6.3 of [73] , respectively). We also provide a comprehensive evaluation of the mechanisms we build on top of the TL-DRAM substrate for both single-and multi-core systems (Section 8 of [73] ).
Related Work
To our knowledge, our HPCA 2013 paper [73] is the rst to i) enable latency heterogeneity in DRAM without significantly increasing the DRAM cost per bit, and ii) propose hardware/software mechanisms that leverage this latency heterogeneity to improve system performance. We make the following major contributions.
A Cost-E cient Low-Latency DRAM. Based on the key observation that long internal wires (bitlines) are the dominant source of DRAM latency, our HPCA 2013 paper [73] proposes a new DRAM architecture called Tiered-Latency DRAM (TL-DRAM). To our knowledge this is the rst work to enable low-latency DRAM without signi cantly increasing the DRAM cost per bit. By adding a single isolation transistor to each bitline, we carve out a region within a DRAM chip, called the near segment, which is fast and energy-e cient. This comes at a modest overhead of 3% increase in DRAM diearea. While there are two prior approaches to reduce DRAM latency (using short bitlines [85, 112] , adding an SRAM cache in DRAM [32, 36, 39, 142] ), both of these approaches signicantly increase die-area due to additional sense ampli ers or additional area for an SRAM cache, as we evaluate in our full paper [73] . Compared to these prior approaches, TL-DRAM is a much more cost-e ective architecture for achieving low latency.
There are many recent works that reduce overall memory access latency by modifying DRAM, the DRAM-controller interface, and DRAM controllers. These works enable more parallelism and bandwidth [22, 60, 71, 116] , reduce refresh counts [50, 51, 52, 53, 75, 76, 103, 134] , accelerate bulk operations [23, 114, 116, 117, 118] , accelerate computation in the logic layer of 3D-stacked DRAM [1, 2, 7, 8, 33, 35, 40, 41, 55, 77, 101, 141] , enable better communication between CPU and other devices through DRAM [69] , leverage process variation and temperature dependency in DRAM [20, 21, 24, 70, 72] , leverage designinduced variation in DRAM [68] , leverage DRAM access patterns [37, 38, 123] , reduce write-related latencies by better designing DRAM and DRAM control policies [26, 66, 113] , and reduce overall queuing latencies in DRAM by better scheduling memory requests [29, 30, 31, 34, 42, 43, 49, 58, 59, 65, 87, 88, 89, 94, 95, 121, 126, 127, 133] . Our proposal is orthogonal to all of these approaches and can be applied in conjunction with them to achieve higher latency and energy bene ts.
Inter-Segment Data Transfer. By implementing latency heterogeneity within a DRAM subarray, TL-DRAM enables e cient data transfer between the fast and slow segments by utilizing the bitlines as a wide bus. This mechanism takes advantage of the fact that both the source and destination cells share the same bitlines. Furthermore, this inter-segment migration happens only within a DRAM bank and does not utilize the DRAM channel, thereby allowing concurrent accesses to other banks over the channel. This inter-segment data transfer enables fast and e cient movement of data within DRAM, which in turn enables e cient ways of taking advantage of latency heterogeneity.
Other works that leverage latency heterogeneity in DRAM do not usually provide any e cient mechanism of intersegment data migration between di erent latency segments. For example, Son et al. [124] propose a low-latency DRAM architecture that has di erent, fast (long bitline) and slow (short bitline) subarrays in DRAM. This approach provides the signi cant bene t only if latency-critical data is already allocated to the low-latency regions (the low latency subarrays). Therefore, the overall memory system performance is very sensitive to the page placement policy, and the system cannot easily adopt to changes in the access latency of pages. In contrast, our new inter-segment data transfer mechanism enables e cient relocation of pages, leading to e cient dynamic page placement and relocation based on the dynamically determined latency criticality of each page. Several more recent works [23, 114, 116, 117] take advantage of our concept of inter-segment data transfer mechanism to perform page copy/initialization and bulk bitwise operations completely within a DRAM chip.
Potential Long-Term Impact
Tolerating High DRAM Latency by Enabling New Layers in the Memory Hierarchy. Today, there is a large latency cli between the on-chip last level cache and o -chip DRAM, leading to a large performance fall-o when applications start missing in the last level cache. By introducing an additional fast layer (the near segment) within the DRAM itself, TL-DRAM smoothens this latency cli .
Note that many recent works add a DRAM cache or create heterogeneous main memories [25, 28, 62, 63, 74, 81, 82, 83, 102, 106, 108, 109, 138, 140] to smooth the latency cli between the last level cache and a longer-latency non-volatile main memory, e.g., phase-change memory [62, 63, 64, 83, 84, 104, 106, 137, 139] , STT-MRAM [61, 83, 97, 135] , or RRAM/memristors [27, 125, 136] , or to take advantage of the advantages of multiple di erent types of memories to optimize for multiple metrics. Our approach is similar at the high-level (i.e., to reduce the latency cli at low cost by taking advantage of heterogeneity), yet we introduce the new low-latency layer within DRAM itself instead of adding a completely separate device. Tiered-Latency DRAM can also be used as a fast DRAM cache.
Applicability to Future Memory Devices. We show the bene ts of TL-DRAM's asymmetric latencies. Considering that most memory devices adopt a similar cell organization (i.e., a two-dimensional cell array and row/column bus connections), our approach of reducing the electrical load of connecting to a bus (bitline) to achieve low access latency can be applicable to other memory devices. Furthermore, the idea of performing inter-segment data transfer can also potentially be applied to other memory devices, regardless of the memory technology. For example, we believe it is promising to examine similar approaches for emerging memory technologies like phase-change memory [62, 63, 64, 83, 84, 104, 106, 137, 139] , STT-MRAM [61, 83, 97, 135] , or RRAM/memristors [27, 125, 136] , as well as NAND ash memory technology [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 78, 79, 80, 81] .
New Research Opportunities. The TL-DRAM substrate creates new opportunities by enabling mechanisms that can leverage the latency heterogeneity o ered by the substrate. We brie y describe three directions, but we believe that there are many new possibilities.
• New ways of leveraging TL-DRAM: TL-DRAM is a substrate that can be utilized for many applications. Although we describe two major ways of leveraging TL-DRAM in our HPCA 2013 paper [73] , we believe there are more ways to leverage the TL-DRAM substrate both in hardware and software. For instance, new mechanisms could be devised to detect data that is latency critical (e.g., data that causes many threads to become serialized [31, 45, 46, 130, 131] or data that belongs to threads that are more latency-sensitive or important [4, 5, 29, 58, 59, 65, 67, 126, 127, 128, 129, 133] ) or could become latency critical in the near future and allocate/prefetch such data into the near segment.
• Opening up new design spaces with multiple tiers: TL-DRAM can be easily extended to have multiple latency tiers by adding more isolation transistors to the bitlines, providing more latency asymmetry. Our HPCA 2013 paper [73] provides an analysis of the latency of a TL-DRAM design with three tiers, showing the spread in latency for three tiers. This enables new mechanisms both in hardware and software that can allocate data appropriately to di erent tiers based on their access characteristics such as locality, criticality, priority, etc.
• Inspiring new ways of architecting latency heterogeneity within DRAM: To our knowledge, TL-DRAM is the rst to enable latency heterogeneity within DRAM, which is signi cantly modifying the existing DRAM architecture. We believe that this could inspire research on other possible ways of architecting latency heterogeneity within DRAM [20, 21, 24, 37, 38, 68, 70, 72] or other memory devices. Note that recent works that are after our HPCA 2013 paper clearly exploit this promising direction proposed by our paper [20, 21, 24, 37, 38, 68, 70, 72, 116] .
