26 research outputs found
Towards Design and Analysis For High-Performance and Reliable SSDs
NAND Flash-based Solid State Disks have many attractive technical merits, such as low power consumption, light weight, shock resistance, sustainability of hotter operation regimes, and extraordinarily high performance for random read access, which makes SSDs immensely popular and be widely employed in different types of environments including portable devices, personal computers, large data centers, and distributed data systems.
However, current SSDs still suffer from several critical inherent limitations, such as the inability of in-place-update, asymmetric read and write performance, slow garbage collection processes, limited endurance, and degraded write performance with the adoption of MLC and TLC techniques. To alleviate these limitations, we propose optimizations from both specific outside applications layer and SSDs\u27 internal layer. Since SSDs are good compromise between the performance and price, so SSDs are widely deployed as second layer caches sitting between DRAMs and hard disks to boost the system performance. Due to the special properties of SSDs such as the internal garbage collection processes and limited lifetime, traditional cache devices like DRAM and SRAM based optimizations might not work consistently for SSD-based cache. Therefore, for the outside applications layer, our work focus on integrating the special properties of SSDs into the optimizations of SSD caches. Moreover, our work also involves the alleviation of the increased Flash write latency and ECC complexity due to the adoption of MLC and TLC technologies by analyzing the real work workloads
์ด์ง ๋ด๋ด ๋คํธ์ํฌ๋ฅผ ์ํ DRAM ๊ธฐ๋ฐ์ ๋ด๋ด ๋คํธ์ํฌ ๊ฐ์๊ธฐ ๊ตฌ์กฐ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ, 2021. 2. ์ ์น์ฃผ.In the convolutional neural network applications, most computations occurred by the multiplication and accumulation of the convolution and fully-connected layers. From the hardware perspective (i.e., in the gate-level circuits), these operations are performed by many dot-products between the feature map and kernel vectors. Since the feature map and kernel have the matrix form, the vector converted from 3D, or 4D matrices is reused many times for the matrix multiplications. As the throughput of the DNN increases, the power consumption and performance bottleneck due to the data movement become a more critical issue. More importantly, power consumption due to off-chip memory accesses dominates total power since off-chip memory access consumes several hundred times greater power than the computation. The accelerators' throughput is about several hundred GOPS~several TOPS, but Memory bandwidth is less than 25.6 or 34 GB/s (with DDR4 or LPDDR4).
By reducing the network size and/or data movement size, both data movement power and performance bottleneck problems are improved. Among the algorithms, Quantization is widely used. Binary Neural Networks (BNNs) dramatically reduce precision down to 1 bit. The accuracy is much lower than that of the FP16, but the accuracy is continuously improving through various studies. With the data flow control, there is a method of reducing redundant data movement by increasing data reuse. The above two methods are widely applied in accelerators because they do not need additional computations in the inference computation.
In this dissertation, I present 1) a DRAM-based accelerator architecture and 2) a DRAM refresh method to improve performance reduction due to DRAM refresh. Both methods are orthogonal, so can be integrated into the DRAM chip and operate independently.
First, we proposed a DRAM-based accelerator architecture capable of massive and large vector dot product operation. In the field of CNN accelerators to which BNN can be applied, a computing-in-memory (CIM) structure that utilizes a cell-array structure of Memory for vector dot product operation is being actively studied. Since DRAM stores all the neural network data, it is advantageous to reduce the amount of data transfer. The proposed architecture operates by utilizing the basic operation of the DRAM.
The second method is to reduce the performance degradation and power consumption caused by DRAM refresh. Since the DRAM cannot read and write data while performing a periodic refresh, system performance decreases. The proposed refresh method tests the refresh characteristics inside the DRAM chip during self-refresh and increases the refresh cycle according to the characteristics. Since it operates independently inside DRAM, it can be applied to all systems using DRAM and is the same for deep neural network accelerators.
We surveyed system integration with a software stack to use the in-DRAM accelerator in the DL framework. As a result, it is expected to control in-DRAM accelerators with the memory controller implementation method verified in the previous experiment. Also, we have added the performance simulation function of in-DRAM accelerator to PyTorch. When running a neural network in PyTorch, it reports the computation latency and data movement latency occurring in the layer running in the in-DRAM accelerator. It is a significant advantage to predict the performance when running in hardware while co-designing the network.์ปจ๋ณผ๋ฃจ์
๋ ๋ด๋ด ๋คํธ์ํฌ (CNN) ์ดํ๋ฆฌ์ผ์ด์
์์๋, ๋๋ถ๋ถ์ ์ฐ์ฐ์ด ์ปจ๋ณผ๋ฃจ์
๋ ์ด์ด์ ํ๋ฆฌ-์ปค๋ฅํฐ๋ ๋ ์ด์ด์์ ๋ฐ์ํ๋ ๊ณฑ์
๊ณผ ๋์ ์ฐ์ฐ์ด๋ค. ๊ฒ์ดํธ-๋ก์ง ๋ ๋ฒจ์์๋, ๋๋์ ๋ฒกํฐ ๋ด์ ์ผ๋ก ์คํ๋๋ฉฐ, ์
๋ ฅ๊ณผ ์ปค๋ ๋ฒกํฐ๋ค์ ๋ฐ๋ณตํด์ ์ฌ์ฉํ์ฌ ์ฐ์ฐํ๋ค. ๋ฅ ๋ด๋ด ๋คํธ์ํฌ ์ฐ์ฐ์๋ ๋ฒ์ฉ ์ฐ์ฐ ์ ๋๋ณด๋ค, ๋จ์ํ ์ฐ์ฐ์ด ๊ฐ๋ฅํ ์์ ์ฐ์ฐ ์ ๋์ ๋๋์ผ๋ก ์ฌ์ฉํ๋ ๊ฒ์ด ์ ํฉํ๋ค. ๊ฐ์๊ธฐ์ ์ฑ๋ฅ์ด ์ผ์ ์ด์ ๋์์ง๋ฉด, ๊ฐ์๊ธฐ์ ์ฑ๋ฅ์ ์ฐ์ฐ์ ํ์ํ ๋ฐ์ดํฐ ์ ์ก์ ์ํด ์ ํ๋๋ค. ๋ฉ๋ชจ๋ฆฌ์์ ๋ฐ์ดํฐ๋ฅผ ์คํ-์นฉ์ผ๋ก ์ ์กํ ๋์ ์๋์ง ์๋ชจ๊ฐ, ์ฐ์ฐ ์ ๋์์ ์ฐ์ฐ์ ์ฌ์ฉ๋๋ ์๋์ง์ ์๋ฐฑ๋ฐฐ๋ก ํฌ๋ค. ๋ํ ์ฐ์ฐ๊ธฐ์ ์ฑ๋ฅ์ ์ด๋น ์๋ฐฑ ๊ธฐ๊ฐ~์ ํ
๋ผ-์ฐ์ฐ์ด ๊ฐ๋ฅํ์ง๋ง, ๋ฉ๋ชจ๋ฆฌ์ ๋ฐ์ดํฐ ์ ์ก์ ์ด๋น ์์ญ ๊ธฐ๊ฐ ๋ฐ์ดํธ์ด๋ค.
๋ฐ์ดํฐ ์ ์ก์ ์ํ ํ์์ ์ฑ๋ฅ ๋ฌธ์ ๋ฅผ ๋์์ ํด๊ฒฐํ๋ ๋ฐฉ๋ฒ์, ์ ์ก๋๋ ๋ฐ์ดํฐ ํฌ๊ธฐ๋ฅผ ์ค์ด๋ ๊ฒ์ด๋ค. ์๊ณ ๋ฆฌ์ฆ ์ค์์๋ ๋คํธ์ํฌ์ ๋ฐ์ดํฐ๋ฅผ ์์ํํ์ฌ, ๋ฎ์ ์ ๋ฐ๋๋ก ๋ฐ์ดํฐ๋ฅผ ํํํ๋ ๋ฐฉ๋ฒ์ด ๋๋ฆฌ ์ฌ์ฉ๋๋ค. ์ด์ง ๋ด๋ด ๋คํธ์ํฌ(BNN)๋ ์ ๋ฐ๋๋ฅผ 1๋นํธ๊น์ง ๊ทน๋จ์ ์ผ๋ก ๋ฎ์ถ๋ค. 16๋นํธ ์ ๋ฐ๋๋ณด๋ค ๋คํธ์ํฌ์ ์ ํ๋๊ฐ ๋ฎ์์ง๋ ๋ฌธ์ ๊ฐ ์์ง๋ง, ๋ค์ํ ์ฐ๊ตฌ๋ฅผ ํตํด ์ ํ๋๊ฐ ์ง์์ ์ผ๋ก ๊ฐ์ ๋๊ณ ์๋ค. ๋ํ ๊ตฌ์กฐ์ ์ผ๋ก๋, ์ ์ก๋ ๋ฐ์ดํฐ๋ฅผ ์ฌ์ฌ์ฉํ์ฌ ๋์ผํ ๋ฐ์ดํฐ์ ๋ฐ๋ณต์ ์ธ ์ ์ก์ ์ค์ด๋ ๋ฐฉ๋ฒ์ด ์๋ค. ์์ ๋ ๊ฐ์ง ๋ฐฉ๋ฒ์ ์ถ๋ก ๊ณผ์ ์์ ๋ณ๋์ ์ฐ์ฐ ์์ด ์ ์ฉ ๊ฐ๋ฅํ์ฌ ๊ฐ์๊ธฐ์์ ๋๋ฆฌ ์ ์ฉ๋๊ณ ์๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋, DRAM ๊ธฐ๋ฐ์ ๊ฐ์๊ธฐ ๊ตฌ์กฐ๋ฅผ ์ ์ํ๊ณ , DRAM refresh์ ์ํ ์ฑ๋ฅ ๊ฐ์๋ฅผ ๊ฐ์ ํ๋ ๊ธฐ์ ์ ์ ์ํ์๋ค. ๋ ๋ฐฉ๋ฒ์ ํ๋์ DRAM ์นฉ์ผ๋ก ์ง์ ๊ฐ๋ฅํ๋ฉฐ, ๋
๋ฆฝ์ ์ผ๋ก ๊ตฌ๋ ๊ฐ๋ฅํ๋ค.
์ฒซ๋ฒ์งธ๋ ๋๋์ ๋ฒกํฐ ๋ด์ ์ฐ์ฐ์ด ๊ฐ๋ฅํ DRAM ๊ธฐ๋ฐ ๊ฐ์๊ธฐ์ ๋ํ ์ฐ๊ตฌ์ด๋ค. BNN์ ์ ์ฉํ ์ ์๋ CNN๊ฐ์๊ธฐ ๋ถ์ผ์์, ๋ฉ๋ชจ๋ฆฌ์ ์
-์ด๋ ์ด ๊ตฌ์กฐ๋ฅผ ๋ฒกํฐ ๋ด์ ์ฐ์ฐ์ ํ์ฉํ๋ ์ปดํจํ
-์ธ-๋ฉ๋ชจ๋ฆฌ(CIM) ๊ตฌ์กฐ๊ฐ ํ๋ฐํ ์ฐ๊ตฌ๋๊ณ ์๋ค. ํนํ, DRAM์๋ ๋ด๋ด ๋คํธ์ํฌ์ ๋ชจ๋ ๋ฐ์ดํฐ๊ฐ ์๊ธฐ ๋๋ฌธ์, ๋ฐ์ดํฐ ์ ์ก๋์ ๊ฐ์์ ์ ๋ฆฌํ๋ค. ์ฐ๋ฆฌ๋ DRAM ์
-์ด๋ ์ด์ ๊ตฌ์กฐ๋ฅผ ๋ฐ๊พธ์ง ์๊ณ , DRAM์ ๊ธฐ๋ณธ ๋์์ ํ์ฉํ์ฌ ์ฐ์ฐํ๋ ๋ฐฉ๋ฒ์ ์ ์ํ์๋ค.
๋๋ฒ์งธ๋ DRAM ๋ฆฌํ๋ ์ฌ ์ฃผ๊ธฐ๋ฅผ ๋๋ ค์ ์ฑ๋ฅ ์ดํ์ ํ์ ์๋ชจ๋ฅผ ๊ฐ์ ํ๋ ๋ฐฉ๋ฒ์ด๋ค. DRAM์ด ๋ฆฌํ๋ ์ฌ๋ฅผ ์คํํ ๋๋ง๋ค, ๋ฐ์ดํฐ๋ฅผ ์ฝ๊ณ ์ธ ์ ์๊ธฐ ๋๋ฌธ์ ์์คํ
ํน์ ๊ฐ์๊ธฐ์ ์ฑ๋ฅ ๊ฐ์๊ฐ ๋ฐ์ํ๋ค. DRAM ์นฉ ๋ด๋ถ์์ DRAM์ ๋ฆฌํ๋ ์ฌ ํน์ฑ์ ํ
์คํธํ๊ณ , ๋ฆฌํ๋ ์ฌ ์ฃผ๊ธฐ๋ฅผ ๋๋ฆฌ๋ ๋ฐฉ๋ฒ์ ์ ์ํ์๋ค. DRAM ๋ด๋ถ์์ ๋
๋ฆฝ์ ์ผ๋ก ๋์ํ๊ธฐ ๋๋ฌธ์ DRAM์ ์ฌ์ฉํ๋ ๋ชจ๋ ์์คํ
์ ์ ์ฉ ๊ฐ๋ฅํ๋ฉฐ, ๋ฅ ๋ด๋ด ๋คํธ์ํฌ ๊ฐ์๊ธฐ์์๋ ๋์ผํ๋ค.
๋ํ, ์ ์๋ ๊ฐ์๊ธฐ๋ฅผ PyTorch์ ๊ฐ์ด ๋๋ฆฌ ์ฌ์ฉ๋๋ ๋ฅ๋ฌ๋ ํ๋ ์ ์ํฌ์์๋ ์ฝ๊ฒ ์ฌ์ฉํ ์ ์๋๋ก, ์ํํธ์จ์ด ์คํ์ ๋น๋กฏํ system integration ๋ฐฉ๋ฒ์ ์กฐ์ฌํ์๋ค. ๊ฒฐ๊ณผ์ ์ผ๋ก, ๊ธฐ์กด์ TVM compiler์ FPGA๋ก ๊ตฌํํ๋ TVM/VTA ๊ฐ์๊ธฐ์, DRAM refresh ์คํ์์ ๊ฒ์ฆ๋ ๋ฉ๋ชจ๋ฆฌ ์ปจํธ๋กค๋ฌ์ ์ปค์คํ
์ปดํ์ผ๋ฌ๋ฅผ ์ถ๊ฐํ๋ฉด in-DRAM ๊ฐ์๊ธฐ๋ฅผ ์ ์ดํ ์ ์์ ๊ฒ์ผ๋ก ๊ธฐ๋๋๋ค. ์ด์ ๋ํ์ฌ, in-DRAM ๊ฐ์๊ธฐ์ ๋ด๋ด ๋คํธ์ํฌ์ ์ค๊ณ ๋จ๊ณ์์ ์ฑ๋ฅ์ ์์ธกํ ์ ์๋๋ก, ์๋ฎฌ๋ ์ด์
๊ธฐ๋ฅ์ PyTorch์ ์ถ๊ฐํ์๋ค. PyTorch์์ ์ ๊ฒฝ๋ง์ ์คํํ ๋, DRAM ๊ฐ์๊ธฐ์์ ์คํ๋๋ ๊ณ์ธต์์ ๋ฐ์ํ๋ ๊ณ์ฐ ๋๊ธฐ ์๊ฐ ๋ฐ ๋ฐ์ดํฐ ์ด๋ ์๊ฐ์ ํ์ธํ ์ ์๋ค.Abstract i
Contents viii
List of Tables x
List of Figures xiv
Chapter 1 Introduction 1
Chapter 2 Background 6
2.1 Neural Network Operation . . . . . . . . . . . . . . . . 6
2.2 Data Movement Overhead . . . . . . . . . . . . . . . . 7
2.3 Binary Neural Networks . . . . . . . . . . . . . . . . . 10
2.4 Computing-in-Memory . . . . . . . . . . . . . . . . . . 11
2.5 Memory Bottleneck due to Refresh . . . . . . . . . . . . 13
Chapter 3 In-DRAM Neural Network Accelerator 16
3.1 Backgrounds . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.1 DRAM hierarchy . . . . . . . . . . . . . . . . . 18
3.1.2 DRAM Basic Operation . . . . . . . . . . . . . 21
3.1.3 DRAM Commands with Timing Parameters . . . 22
3.1.4 Bit-wise Operation in DRAM . . . . . . . . . . 25
3.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Proposed architecture . . . . . . . . . . . . . . . . . . . 30
3.3.1 Operation Examples of Row Operator . . . . . . 32
3.3.2 Convolutions on DRAM Chip . . . . . . . . . . 39
3.4 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.1 Input Broadcasting in DRAM . . . . . . . . . . 44
3.4.2 Input Data Movement With M2V . . . . . . . . . 47
3.4.3 Internal Data Movement With SiD . . . . . . . . 49
3.4.4 Data Partitioning for Parallel Operation . . . . . 52
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.1 Performance Estimation . . . . . . . . . . . . . 56
3.5.2 Configuration of In-DRAM Accelerator . . . . . 58
3.5.3 Improving the Accuracy of BNN . . . . . . . . . 60
3.5.4 Comparison with the Existing Works . . . . . . . 62
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.6.1 Performance Comparison with ASIC Accelerators 67
3.6.2 Challenges of The Proposed Architecture . . . . 70
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 72
Chapter 4 Reducing DRAM Refresh Power Consumption
by Runtime Profiling of Retention Time and Dualrow
Activation 74
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5 Solution overview . . . . . . . . . . . . . . . . . . . . . 88
4.6 Runtime profiling . . . . . . . . . . . . . . . . . . . . . 93
4.6.1 Basic Operation . . . . . . . . . . . . . . . . . . 93
4.6.2 Profiling Multiple Rows in Parallel . . . . . . . . 96
4.6.3 Temperature, Data Backup and Error Check . . . 96
4.7 Dual-row Activation . . . . . . . . . . . . . . . . . . . . 98
4.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 102
4.8.1 Experimental Setup . . . . . . . . . . . . . . . . 103
4.8.2 Refresh Period Improvement . . . . . . . . . . . 107
4.8.3 Power Reduction . . . . . . . . . . . . . . . . . 110
4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 116
Chapter 5 System Integration 118
5.1 Integrate The Proposed Methods . . . . . . . . . . . . . 118
5.2 Software Stack . . . . . . . . . . . . . . . . . . . . . . 121
Chapter 6 Conclusion 129
Bibliography 131
๊ตญ๋ฌธ์ด๋ก 153Docto
Recommended from our members
Strong, thorough, and efficient memory protection against existing and emerging DRAM errors
Memory protection is necessary to ensure the correctness of data in the presence of unavoidable faults. As such, large-scale systems typically employ Error Correcting Codes (ECC) to trade off redundant storage and bandwidth for increased reliability. Single Device Data Correction (SDDC) ECC mechanisms are required to meet the reliability demands of servers and large-scale systems by tolerating even severe faults that disable an entire memory chip. In the future, however, stronger memory protection will be required due to increasing levels of system integration, shrinking process technology, and growing transfer rates. The energy-efficiency of memory protection is also important as DRAM already consumes a significant fraction of system energy budget. This dissertation develops a novel set of ECC schemes to provide strong, safe, flexible, and thorough protection against existing and emerging types of DRAM errors. This research also reduces energy consumption of such protection while only marginally impacting performance. First, this dissertation develops Bamboo ECC, a technique with strongerthan-SDDC correction and very safe detection capabilities (โฅ 99.999994% of data errors with any severity are detected). Bamboo ECC changes ECC layout based on frequent DRAM error patterns, and can correct concurrent errors from multiple devices and all but eliminates the risk of silent data corruption. Also, Bamboo ECC provides flexible configurations to enable more adaptive graceful downgrade schemes in which the system continues to operate correctly after even severe chip faults, albeit at a reduced capacity to protect against future faults. These strength, safety, and flexibility advantages translate to a significantly more reliable memory sub-system for future exascale computing. Then, this dissertation focuses on emerging error types from scaling process technology and increasing data bandwidth. As DRAM process technology scales down to below 10nm, DRAM cells are becoming more vulnerable to errors from an imperfect manufacturing process. At the same time, DRAM signal transfers are getting more susceptible to timing and electrical noises as DRAM interfaces keep increasing signal transfer rates and decreasing I/O voltage levels. With individual DRAM chips getting more vulnerable to errors, industry and academia have proposed mechanisms to tolerate these emerging types of errors; yet they are inefficient because they rely on multiple levels of redundancy in the case of cell errors and ad-hoc schemes with suboptimal protection coverage for transmission errors. Active Guardband ECC and All-Inclusive ECC make systematic use of ECC and existing mechanisms to provide thorough end-to-end protection without requiring redundancy beyond what is common today. Finally, this dissertation targets the energy efficiency of memory protection. Frugal ECC combines ECC with fine-grained compression to provide versatile and energy-efficient protection. Frugal ECC compresses main memory at cache-block granularity, using any left over space to store ECC information. Frugal ECC allows more energy-efficient memory configurations while maintaining SDDC protection. Its tailored compression scheme minimizes insufficiently compressed blocks and results in acceptable performance overhead. The strong, thorough, and efficient protection described by this dissertation may allow for more aggressive design of future computing systems with larger integration, finer process technology, higher transfer rates, and better energy efficiencyElectrical and Computer Engineerin
High-Performance Energy-Efficient and Reliable Design of Spin-Transfer Torque Magnetic Memory
In this dissertation new computing paradigms, architectures and design philosophy are proposed and evaluated for adopting the STT-MRAM technology as highly reliable, energy efficient and fast memory. For this purpose, a novel cross-layer framework from the cell-level all the way up to the system- and application-level has been developed. In these framework, the reliability issues are modeled accurately with appropriate fault models at different abstraction levels in order to analyze the overall failure rates of the entire memory and its Mean Time To Failure (MTTF) along with considering the temperature and process variation effects. Design-time, compile-time and run-time solutions have been provided to address the challenges associated with STT-MRAM. The effectiveness of the proposed solutions is demonstrated in extensive experiments that show significant improvements in comparison to state-of-the-art solutions, i.e. lower-power, higher-performance and more reliable STT-MRAM design
Adaptation in Standard CMOS Processes with Floating Gate Structures and Techniques
We apply adaptation into ordinary circuits and systems to achieve high performance, high quality results. Mismatch in manufactured VLSI devices has been the main limiting factor in quality for many analog and mixed-signal designs. Traditional compensation methods are generally costly. A few examples include enlarging the device size, averaging signals, and trimming with laser. By applying floating gate adaptation to standard CMOS circuits, we demonstrate here that we are able to trim CMOS comparator offset to a precision of 0.7mV, reduce CMOS image sensor fixed-pattern noise power by a factor of 100, and achieve 5.8 effective number of bits (ENOB) in a 6-bit flash analog-to-digital converter (ADC) operating at 750MHz.
The adaptive circuits generally exhibit special features in addition to an improved performance. These special features are generally beyond the capabilities of traditional CMOS design approaches and they open exciting opportunities in novel circuit designs. Specifically, the adaptive comparator has the ability to store an accurate arbitrary offset, the image sensor can be set up to memorize previously captured scenes like a human retina, and the ADC can be configured to adapt to the incoming analog signal distribution and perform an efficient signal conversion that minimizes distortion and maximizes output entropy
Understanding and Improving the Latency of DRAM-Based Memory Systems
Over the past two decades, the storage capacity and access bandwidth of main
memory have improved tremendously, by 128x and 20x, respectively. These
improvements are mainly due to the continuous technology scaling of DRAM
(dynamic random-access memory), which has been used as the physical substrate
for main memory. In stark contrast with capacity and bandwidth, DRAM latency
has remained almost constant, reducing by only 1.3x in the same time frame.
Therefore, long DRAM latency continues to be a critical performance bottleneck
in modern systems. Increasing core counts, and the emergence of increasingly
more data-intensive and latency-critical applications further stress the
importance of providing low-latency memory access.
In this dissertation, we identify three main problems that contribute
significantly to long latency of DRAM accesses. To address these problems, we
present a series of new techniques. Our new techniques significantly improve
both system performance and energy efficiency. We also examine the critical
relationship between supply voltage and latency in modern DRAM chips and
develop new mechanisms that exploit this voltage-latency trade-off to improve
energy efficiency.
The key conclusion of this dissertation is that augmenting DRAM architecture
with simple and low-cost features, and developing a better understanding of
manufactured DRAM chips together lead to significant memory latency reduction
as well as energy efficiency improvement. We hope and believe that the proposed
architectural techniques and the detailed experimental data and observations on
real commodity DRAM chips presented in this dissertation will enable
development of other new mechanisms to improve the performance, energy
efficiency, or reliability of future memory systems.Comment: PhD Dissertatio