This article summarizes key results of our work on experimental characterization and analysis of latency variation and latency-reliability trade-o s in modern DRAM chips, which was published in SIGMETRICS 2016 [24] , and examines the work's signi cance and future potential. Our work is motivated to reduce the long DRAM latency, which is a critical performance bottleneck in current systems. DRAM access latency is de ned by three fundamental operations that take place within the DRAM cell array: (i) activation of a memory row, which opens the row to perform accesses; (ii) precharge, which prepares the cell array for the next memory access; and (iii) restoration of the row, which restores the values of cells in the row that were destroyed due to activation. There is signi cant latency variation for each of these operations across the cells of a single DRAM chip due to irregularity in the manufacturing process. As a result, some cells are inherently faster to access, while others are inherently slower. Unfortunately, existing systems do not exploit this variation.
The goal of this work is to (i) experimentally characterize and understand the latency variation across cells within a DRAM chip for these three fundamental DRAM operations, and (ii) develop new mechanisms that exploit our understanding of the latency variation to reliably improve performance. To this end, we comprehensively characterize 240 DRAM chips from three major vendors, and make six major new observations about latency variation within DRAM. Notably, we nd that (i) there is large latency variation across the cells for each of the three operations; (ii) variation characteristics exhibit signi cant spatial locality: slower cells are clustered in certain regions of a DRAM chip; and (iii) the three fundamental operations exhibit di erent reliability characteristics when the latency of each operation is reduced.
Based on our observations, we propose Flexible-LatencY DRAM (FLY-DRAM), a mechanism that exploits latency variation across DRAM cells within a DRAM chip to improve system performance. The key idea of FLY-DRAM is to exploit the spatial locality of slower cells within DRAM, and access the faster DRAM regions with reduced latencies for the fundamental operations. Our evaluations show that FLY-DRAM improves the performance of a wide range of applications by 13.3%, 17.6%, and 19.5%, on average, for each of the three di erent vendors' real DRAM chips, in a simulated 8-core system.
We have open sourced the data from our research online. We hope the characterization and analysis we provide opens up new research directions for both researchers and practitioners in computer architecture and systems.
Introduction
Over the past few decades, the long latency of memory has been a critical bottleneck in system performance. Increasing core counts, the emergence of more data-intensive and latency-critical applications, and increasingly limited bandwidth in the memory system are together leading to higher memory latency. Thus, low-latency memory operation is now even more important to improving overall system performance [30, 55, 93, 101, 102, 105, 143] .
The latency of a memory request is predominantly de ned by the timings of three fundamental operations: (1) activation, which "opens" a row of DRAM cells to access stored data, (2) precharge, which "closes" an activated row, and (3) restoration, which restores the charge level of each DRAM cell in a row to prevent data loss. 1 The latencies of these three DRAM operations, as de ned by vendor speci cations, have not improved signi cantly in the past 18 years, as depicted in Figure 1 . This is especially true when we compare latency improvements to the capacity (128×) and bandwidth improvements (20×) [23] commodity DRAM chips experienced in the past 18 years. In fact, the activation and precharge latencies increased from 2013 to 2015, when DDR DRAM transitioned from the third generation (12.5ns for DDR3-1600J [51] ) to the fourth generation (14.06ns for DDR4-2133P [53] ). As the latencies speci ed by vendors have not reduced over time, the memory latency remains as a critical system performance bottleneck in many modern applications, such as big data workloads [28] and Google's warehouse-scale workloads [55] .
+21% -29%
-17% +8% Figure 1 : DRAM latency trends over time [50, 51, 53, 97] . Adapted from [24] .
Motivation
In this work, we observe that the three fundamental DRAM operations can actually complete with a much lower latency for many DRAM cells than the vendor speci cation, because there is inherent latency variation present across the DRAM cells within a DRAM chip. This is a result of manufacturing process variation, which causes the sizes and strengths of cells to be di erent, thus making some cells faster and other cells slower to be accessed reliably [85] . The speed gap between the fastest and slowest DRAM cells is getting worse [20, 107] , as the technology node continues to scale down to sub-20nm feature sizes. Unfortunately, instead of optimizing the latency speci cations for the common case, DRAM vendors use a single set of standard access latencies, called timing parameters, which provide reliable operation guarantees for the worst case (i.e., the slowest cells), to maximize manufacturing yield.
We experimentally demonstrate that signi cant latency variation is present across DRAM cells in 240 DDR3 DRAM chips from three major vendors, and that a large fraction of cells can be read reliably even if the activation/restoration/precharge latencies are reduced signi cantly. By repeatedly testing these DRAM chips, we observe that access latency variation exhibits spatial locality within DRAM -slower cells cluster in certain regions of a DRAM chip. In Section 4, we propose a new mechanism, called FLY-DRAM, which exploits the lower latencies of DRAM regions with faster cells by introducing heterogeneous timing parameters into the memory controller. By analyzing and exploiting the latency variation that exists in DRAM cells, we can greatly reduce the DRAM access latency.
We discuss our major experimental observations in Section 3. For a detailed discussion on all of our observations, we refer the reader to our SIGMETRICS 2016 paper [24] .
Latency Variation Analysis
To capture the e ect of latency variation in modern DDR3 DRAM chips, we tune the timing parameters that control the amount of time taken for each of the fundamental DRAM operations. We developed an FPGA-based DRAM testing platform [43] that allows us to precisely control the timing parameter values and the tested DRAM location (i.e., banks, rows, and columns). A photo of the platform is shown in Figure 2 . Using this platform, we characterize latency variation on a total of 30 DDR3 DRAM modules (or DIMMs), comprising 240 DRAM chips from three major vendors. Each chip has a 1Gb density. Thus, each of our DIMMs has a 1GB capacity. Table 1 lists the relevant information about the tested DRAM modules. Unless otherwise speci ed, we test modules at an ambient temperature of 20±1℃. For results using higher temperatures, we refer the reader to Section 4.5 of our SIGMETRICS 2016 paper [24] . Table 1 : Main properties of the tested DIMMs. Reproduced from [24] .
In this section, we present a short summary of our key results on varying the activation, precharge, and restoration latencies, which are controlled by the tRCD, tRP, and tRAS timing parameters, respectively. For more details on the experimental results and observations, see Sections 4-6 of our SIGMETRICS 2016 paper [24] .
Behavior of Timing Errors
We analyze the variation in the latencies of activation, precharge, and restoration by operating DRAM at multiple reduced latencies for each of these operations. Faster cells do not get a ected by the reduced timings, and can be accessed reliably without any errors; however, slower cells cannot be read reliably with reduced latencies for the three operations, leading to bit ips. In this work, we de ne a timing error as a bit ip in a cell that occurs due to a reduced-latency access, and characterize timing errors incurred by the three DRAM operations.
Our experiments yield several new observations on the behavior of timing errors. When we reduce the three latencies, we observe that each latency exhibits a di erent level of impact on the inherently-slower cells. Lowering the activation latency (tRCD) a ects only the cells (data) read in the rst accessed cache line, but not the subsequently read cache lines from the same row. This is mainly due to two reasons. First, a command accesses only its corresponding sense ampli ers, without accessing the other columns. Hence, a 's e ect is isolated to its target cache line. Second, by the time a subsequent is issued to the same activated row, a su cient amount of time has already passed for the row bu er to fully sense and latch in the row data. In contrast, lowering the restoration (tRAS) or precharge (tRP) latencies a ects all cells in the activated row (see Section 5 of our SIGMETRICS 2016 paper [24] for a detailed explanation). Lowering these latencies a ects the entire row because these commands operate at the row level, and they directly a ect the restoration and sensing of all cells in the row.
We also nd that the number of timing errors introduced is very sensitive to reducing the activation or precharge latency, but not that sensitive to reducing the restoration latency. We conclude that di erent levels of mitigation are required to address the timing errors that result from lowering each of the di erent DRAM operation latencies, and that reducing restoration latency to the lowest levels allowed by our infrastructure does not introduce timing errors in our experiments (see Section 6 in our SIGMETRICS 2016 paper [24] ).
Timing Error Distribution
We brie y present the distribution of activation and precharge errors collected from all of the tests conducted on every DIMM. Figure 3 shows the box plots of the bit error rate (BER) observed on every DIMM as activation latency (tRCD) varies. The BER is de ned as the fraction of bits with errors due to reducing tRCD in the total population of tested bits. In other words, the BER represents the fraction of cells that cannot operate reliably under the speci ed shortened latency. The box plot shows the maximum and minimum BER of all of our tested DIMMs as whiskers, and the box shows the quartiles of the distribution. In addition, we show all observation points for each speci c tRCD/tRP value by overlaying them on top of their corresponding box plot. Each point shows a BER collected from one round of tests on one DIMM with a speci c data pattern and tRCD value. For box plots showing the BER distribution when the precharge latency (tRP) is reduced, see Figure 12 in the original paper [24] . We make two observations from the BER distributions when reducing tRCD or tRP.
First, at tRCD or tRP values of 12.5ns and 10ns, we observe no timing errors on any DIMM due to reduced activation or precharge latency. This shows that the tRCD/tRP latencies of the slowest cells in our tested DIMMs likely fall between 7.5 and 10ns, which are lower than the value provided in the vendor speci cations (13.125ns) . DRAM vendors use the extra latency as a guardband to provide additional protection against process variation.
Second, there exists a large BER variation among DIMMs at tRCD of 7.5ns, and the BER variation becomes smaller as the tRCD or tRP value decreases. The number of fast cells that can operate at tRCD=7.5ns or tRP=7.5ns varies signi cantly across di erent DIMMs. These results demonstrate that there exists signi cant latency variation among and within DIMMs, as not all of the cells exhibit timing errors at 7.5ns.
Spatial Locality of Timing Errors
In this section, we investigate the location and distribution of timing errors within a DIMM when the activation or precharge latencies are reduced. Figure 4 shows the probability of every cache line (64B) in one bank of a speci c DIMM observing at least 1 bit of error with reduced activation latency (Figure 4a ) or precharge latency (Figure 4b ). See [24] for additional results. The x-axis and y-axis indicate the cache line number and row number (in thousands), respectively. In our tested DIMMs, a row size is 8KB, comprising 128 cache lines. The main observation is that timing errors due to reducing activation or precharge latency are not distributed uniformly across locations within this DIMM. Timing errors tend to cluster at certain regions of cache lines. For the remaining cache lines, we observe that they do not exhibit timing errors due to reduced latency throughout the experiments. We observe similar characteristics in other DIMMs -timing errors concentrate within certain spatial regions of memory.
We hypothesize that the cause of the spatial locality of timing errors is due to the locality of variation in the fabrication process during manufacturing. Certain cache line locations can end up with less robust components, such as weaker sense ampli ers, weaker cells, or higher resistance bitlines.
Other Characterization Results
We brie y summarize our other observations on the e ects of reducing timing parameters. First, we analyze the number of timing errors that occur when DRAM access latencies are reduced, and experimentally demonstrate that most of the erroneous cache lines have a single-bit error, with only a small fraction of cache lines experiencing more than one bit ip (see Section 4.7 of our SIGMETRICS 2016 paper [24] ). We conclude, therefore, that using simple error-correcting codes (ECC) can correct most of these errors, thereby enabling lower latency for many inherently slower cells (see Section 4.8 of our SIGMETRICS 2016 paper [24] for a detailed analysis of ECC).
Second, we nd that the stored data pattern in cells a ects access latency variation. Certain patterns lead to more timing errors than others. For example, the bit value 1 can be read signi cantly more reliably at a reduced access latency than the bit value 0 (see Section 4.4 of our SIGMETRICS 2016 paper [24] ). This observation is similar to the data pattern dependence observation made for retention times of DRAM cells [57, 58, 59, 60, 86, 110] .
Third, we nd no clear correlation between temperature and variation in cell access latency. We believe that it is not essential for latency reduction techniques that exploit such variation to be aware of the operating temperature (Section 4.5 in [24] ).
Exploiting Latency Variation
Based on our extensive experimental characterization and new observations on latency-reliability trade-o s in modern DRAM chips, we propose a new hardware mechanism, called Flexible-LatencY DRAM (FLY-DRAM), to reduce DRAM latency for better system performance. FLY-DRAM exploits the key observation that (i) di erent cells can operate reliably at di erent DRAM latencies, and (ii) there is a strong correlation between the location of a cell and the lowest latency that the cell can operate reliably at. The key idea of FLY-DRAM is to (i) categorize the DRAM cells into fast and slow regions, (ii) expose this categorization to the memory controller, and (iii) reduce overall DRAM latency by accessing the fast regions with a lower latency.
The FLY-DRAM memory controller (i) loads the latency pro ling results [24] into on-chip SRAM at system boot time, (ii) looks up the pro led latency for each memory request based on its memory address, and (iii) applies the corresponding latency to the request. By reducing the values of tRCD, tRAS, and tRP for some memory requests, FLY-DRAM improves overall system performance. In addition, we also propose an OS page allocator design that exploits the latency variation in DRAM to improve system performance (see Section 7.2 of our paper [24] ).
There are two key design challenges of FLY-DRAM. The rst challenge is determining the fraction of fast cells within a DRAM chip and the innate access latency of the fast cells.
Since DRAM vendors have detailed information on their DRAM chips from the DRAM post-production tests, DRAM vendors can embed the latency pro ling results in the Serial Presence Detect (SPD) circuitry (a ROM present in each DIMM) [52] . The memory controller can read the pro ling results from the SPD circuitry during DRAM initialization, and apply the correct latency for each DRAM region.
The second design challenge is limiting the storage overhead of the latency pro ling results. Recording the shortest latency for each cache line can incur a large storage overhead. Fortunately, the storage overhead can be reduced based on a new observation of ours. As discussed in Section 3.3, timing errors typically concentrate at certain DRAM regions. Therefore, FLY-DRAM records the shortest latency at the granularity of DRAM regions (i.e., a group of adjacent cache lines, rows, or banks). One can imagine using more sophisticated structures, such as Bloom Filters [6] , to provide ner-grained latency information within a reasonable storage overhead, as shown in prior work on variable DRAM refresh intervals [87, 115] .
Summary of Results
We evaluate FLY-DRAM on on an 8-core system with a wide variety of workloads by using Ramulator [64, 120] , a cycle-level open-source DRAM simulator developed by our research group. A , D 7 B , and D 2 C correspond to latency pro les collected from three real DIMMs. Our SIGMETRICS 2016 paper [24] describes these real-DRAM pro les in more detail.
For these three DIMMs, FLY-DRAM improves system performance signi cantly, by 17.6%, 13.3%, and 19.5% on average across all 40 workloads. This is because FLY-DRAM reduces the latency of tRCD, tRP, and tRAS by 42.8%, 42.8%, and 25%, respectively, for a large fraction of cache lines. In particular, DIMM D 2 C , which has a 99% of cells that operate reliably at low tRCD and tRP, performs within 1% of the upper-bound performance (19.7% on average), which is obtained by operating all DRAM cells at low tRCD and tRP. We conclude that FLY-DRAM is an e ective mechanism to improve system performance by exploiting the widespread latency variation present across DRAM cells. Figure 5 : System performance improvement of FLY-DRAM for various DIMMs. Reproduced from [24] .
As we show in our SIGMETRICS 2016 paper [24] , FLY-DRAM can take advantage of an intelligent DRAM-aware page allocator that allocates frequently used and latency-critical pages in fast DRAM regions. We leave the detailed design and evaluation of such an allocator to future work.
Related Work
To our knowledge, this is the rst work to (i) provide a detailed experimental characterization and analysis of latency variation for three major DRAM operations (tRCD, tRP, and tRAS) across di erent cells within a DRAM chip, (ii) demonstrate that a reduction in latency for each of these fundamental operations has a di erent impact on di erent cells, (iii) show that access latency variation exhibits spatial locality, (iv) demonstrate that the error rate due to reduced latencies is correlated with the stored data pattern but not conclusively correlated with temperature, and (v) propose mechanisms that take advantage of variation within a DRAM chip to improve system performance. We discuss the most closely related works here.
DRAM Latency Variation
Adaptive-Latency DRAM (AL-DRAM) also characterizes and exploits DRAM latency variation, but does so at a much coarser granularity [79] . This work experimentally characterizes latency variation across di erent DRAM chips under di erent operating temperatures. AL-DRAM sets a uniform operation latency for the entire DIMM. In contrast, our work characterizes latency variation within each chip, at the granularity of individual DRAM cells. Our mechanism, FLY-DRAM, can be combined with AL-DRAM to further improve performance. 2 A recent work by Lee et al. [76] also observes latency variation within DRAM chips. The work analyzes the variation that is due to the circuit design of DRAM components, which it calls design-induced variation. Furthermore, it proposes a new pro ling technique to identify the lowest DRAM latency without introducing errors. In this work, we provide the rst detailed experimental characterization and analysis of the general latency variation phenomenon within real DRAM chips. Our analysis is broad and is not limited to design-induced variation. Our proposal of exploiting latency variation, FLY-DRAM can employ Lee et al.'s new pro ling mechanism [76] to identify additional latency variation regions for reducing access latency.
Chandrasekar et al. study the potential of reducing some DRAM timing parameters [21] . Similar to AL-DRAM, this work observes and characterizes latency variation across DIMMs, whereas our work studies variation across cells within a DRAM chip.
DRAM Error Studies
There are several studies that characterize various errors in DRAM. Many of these works observe how speci c factors a ect DRAM errors, analyzing the impact of temperature [32, 79] and hard errors [48] . Other works have conducted studies of DRAM error rates in the eld, studying failures across a large sample size [84, 95, 123, 132, 133] . There are also works that have studied errors through controlled experiments, investigating errors due to retention time [43, 57, 58, 59, 60, 86, 110, 115] , disturbance from neighboring DRAM cells [65, 101] , latency variation across/within DRAM chips [21, 76, 78, 79] , and supply voltage [26] . None of these works study errors due to latency variation across the cells within a DRAM chip, which we extensively characterize in our work.
DRAM Latency Reduction
Several types of commodity DRAM (Micron's RL-DRAM [98] and Fujitsu's FCRAM [122] ) provide low latency at the cost of high area overhead [68, 81] . Many prior works (e.g., [22, 25, 45, 68, 81, 88, 101, 102, 106, 125, 127, 128, 131, 150] ) propose various architectural changes within DRAM chips to reduce latency. In contrast, FLY-DRAM does not require any changes to a DRAM chip. Other works [44, 75, 124, 129, 130] reduce DRAM latency by changing the memory controller, and FLY-DRAM is complementary to them.
ECC DRAM
Many memory systems incorporate ECC DIMMs, which store information used to correct data during a read operation. Prior work (e.g., [39, 54, 60, 63, 83, 140, 142, 145, 146] ) proposes more exible or more powerful ECC schemes for DRAM. While these ECC mechanisms are designed to protect against faults using standard DRAM timings, we show that they also have the potential to correct timing errors that occur due to reduced DRAM latencies. A recent work by Lee et al. [76] exploits this observation and uses ECC to correct errors that occur due to reduced latency in DRAM.
Other Latency Reduction Mechanisms
Various prior works [1, 2, 3, 5, 7, 8, 25, 31, 33, 34, 35, 36, 38, 40, 42, 46, 47, 56, 62, 69, 92, 109, 111, 112, 114, 125, 126, 128, 129, 134, 139, 149] examine processing in memory to reduce DRAM latency. Other prior works propose memory scheduling techniques, [4, 37, 49, 66, 67, 74, 99, 100, 103, 104, 135, 136, 137, 138, 141] , which generally reduce latency to access DRAM. Our analyses and techniques can be combined with these works to enable further low-latency operation.
Signi cance
Our SIGMETRICS 2016 paper [24] presents a new experimental characterization and analysis of latency variation in modern DRAM chips. In this section, we describe the potential impact that our study can have on the research community and industry.
Potential Research Impact
Our paper develops a new way of using manufactured DRAM chips: accessing di erent regions of memory using each region's inherent latency instead of a homogeneous xed standard latency for all regions of memory. We show that (i) there is signi cant latency variation within a DRAM chip, and (ii) it is possible to exploit the variation with simple mechanisms. We believe one key impact of our paper is demonstrating the e ectiveness of designing memory optimizations based on real-world characterization. We expect that this same principle can be used to craft new memory architectures for both existing and future memory technologies, such as SRAM, PCM [71, 72, 73, 116, 117, 147, 148] , STT-MRAM [27, 41, 70] , or RRAM [144] .
Our work exposes several opportunities for both operating systems and hardware to further optimize for memory access latency. We have open-sourced our raw characterization data, to allow other researchers to further analyze and build o of our work [120] . Other researchers can nd many other ways to take advantage of the insights and the characterization data we provide. Our FLY-DRAM implementation is also available as part of the open-source release of Ramulator [64, 119] .
ECC to Reduce Latency.
In our paper, we analyze the distribution of timing errors (due to reduced latency) at the granularity of data beats, as conventional error-correcting codes (ECC) work at the same granularity. Our data shows that many of the erroneous data beats experience only a single-bit error, while the majority of the data beats contain no errors. Therefore, this creates an opportunity for applying ECC to correct timing errors. We also envision an opportunity for applying ECC to only certain regions of DRAM, which takes advantage of the spatial locality of timing errors exposed by our work. Lee et al. [76] provide examples of the use of ECC to reduce latency further, but they apply ECC globally to the entire DRAM chip. We believe a signi cant opportunity exists in customizing ECC to latency errors and di erent DRAM reliability issues.
Data Pattern Dependence. We nd that timing errors caused by reducing activation latency are dependent on the stored data pattern. Reading bit 1 is signi cantly more reliable than bit 0 at reduced activation latencies. This asymmetric sensing strength can potentially be a good direction for studying DRAM reliability. Currently, DRAM commonly employs data bus inversion [53] as an encoding scheme to reduce toggle rate on the data bus, thereby saving channel power [113] . Similar encoding techniques can be developed to reduce bit 0s and increase the overall number of 1s in data. We believe that developing asymmetric data encodings or ECC mechanisms that favor 1s over 0s is a promising research direction to improve DRAM reliability.
DRAM-Aware Page Allocator. We developed a hardware mechanism (FLY-DRAM) that exploits latency variation to improve system performance in a software-transparent manner. Researchers can take better advantage of the variation by exposing the di erent latency regions to the software stack. In our SIGMETRICS 2016 paper [24] , we discuss the potential of a DRAM-aware page allocator in the OS (Section 7.2), which can improve FLY-DRAM performance by intelligently mapping more frequently-accessed application pages to faster DRAM regions. We believe that the key idea of enabling the OS to allocate pages based on the accessed memory region's latency can be applied to other types of memory characteristics (e.g., energy e ciency or voltage [26, 29] ) without needing to modify the architecture.
Applicability to Other Memory Technologies. In this work, we focus on characterizing only DRAM technology. A class of emerging memory technology is non-volatile memory (NVM), which has the capability of retaining data even when the memory is not powered. Since the memory organization of NVM mostly resembles that of DRAM [71, 96, 147] , we believe that our characterization and optimization can be extended to di erent types of NVMs, such as PCM [71, 72, 73, 116, 117, 147, 148] , STT-MRAM [27, 41, 70] , or NAND ash memory [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 89, 90, 91 ] to further enhance their reliability or performance.
Long-Term Impact on Industry
High main memory latency remains a problem for many modern applications, such as in-memory databases (e.g., Redis [121] , MemSQL [94] , TimesTen [108] ), Spark, Google's datacenter workloads [28, 55] , and many mobile and interactive workloads. We propose two simple ideas that exploit latency variation in existing DRAM chips. Both can be adopted relatively easily in the processor architecture (i.e., the memory controller) or in the OS.
In addition to improving memory access latency, reducing the latency of the three fundamental DRAM operations also increases the e ective memory bandwidth. To fully utilize the available memory bandwidth, memory controllers would have to maximize the number of or commands. However, due to interference between access streams within and across applications, memory controllers need to constantly open and close rows by issuing and commands due to an increasing number of bank con icts [44, 68] . These commands increase the queuing latency of accesses ( and ), thus decreasing the e ective memory bandwidth utilization.
As pin count is limited and increasing bus frequency is becoming more di cult (due to signal integrity issues [29] ), our work o ers a new alternative to help improve bandwidth utilization. By reducing the latency of DRAM operations, which fall on the critical path of DRAM access time, more accesses per second are allowed, thereby improving the overall e ective bandwidth. Furthermore, improving latency and e ective bandwidth also leads to lower memory energy consumption due to reduced execution time and fewer active cycles.
All these bene ts (e.g., reduced latency, increased bandwidth, and reduced energy) will become much more important as applications become more data-intensive and systems become more energy-constrained in the foreseeable future [102, 105] .
In conclusion, we believe that in the longer term, the idea of leveraging variation in di erent characteristics (e.g., latency, reliability) inside memory chips will become more bene cial for both the software and hardware industry. For example, by making CPU aware of variation behavior in memory devices, memory vendors have an incentive to sell memory with larger variation at a lower price, allowing system designers to lower costs with a small amount of additional logic in hardware. Many other opportunities to improve system performance, energy, and cost abound, which we hope the future works can build upon and exploit.
Conclusion
This paper provides the rst experimental study that comprehensively characterizes and analyzes the latency variation within modern DRAM chips for three fundamental DRAM operations (activation, precharge, and restoration). We nd that signi cant latency variation is present across DRAM cells in all 240 of our tested DRAM chips, and that a large fraction of cache lines can be read reliably even if the activation/restoration/precharge latencies are reduced signi cantly. Consequently, exploiting the latency variation in DRAM cells can greatly reduce the DRAM access latency. Based on the ndings from our experimental characterization, we propose and evaluate a new mechanism, FLY-DRAM (Flexible-LatencY DRAM), which reduces DRAM latency by exploiting the inherent latency variation in DRAM cells. FLY-DRAM reduces DRAM latency by categorizing the DRAM cells into fast and slow regions, and accessing the fast regions with a reduced latency. We demonstrate that FLY-DRAM can greatly reduce DRAM latency, leading to signi cant system performance improvements on a variety of workloads.
We conclude that it is promising to understand and exploit the inherent latency variation within modern DRAM chips. We hope that the experimental characterization, analysis, and optimization techniques presented in this paper will enable the development of other new mechanisms that exploit the latency variation within DRAM to improve system performance and perhaps reliability.
