26 research outputs found

    Data retention in MLC NAND flash memory: Characterization, optimization, and recovery

    Full text link

    Towards Design and Analysis For High-Performance and Reliable SSDs

    Get PDF
    NAND Flash-based Solid State Disks have many attractive technical merits, such as low power consumption, light weight, shock resistance, sustainability of hotter operation regimes, and extraordinarily high performance for random read access, which makes SSDs immensely popular and be widely employed in different types of environments including portable devices, personal computers, large data centers, and distributed data systems. However, current SSDs still suffer from several critical inherent limitations, such as the inability of in-place-update, asymmetric read and write performance, slow garbage collection processes, limited endurance, and degraded write performance with the adoption of MLC and TLC techniques. To alleviate these limitations, we propose optimizations from both specific outside applications layer and SSDs\u27 internal layer. Since SSDs are good compromise between the performance and price, so SSDs are widely deployed as second layer caches sitting between DRAMs and hard disks to boost the system performance. Due to the special properties of SSDs such as the internal garbage collection processes and limited lifetime, traditional cache devices like DRAM and SRAM based optimizations might not work consistently for SSD-based cache. Therefore, for the outside applications layer, our work focus on integrating the special properties of SSDs into the optimizations of SSD caches. Moreover, our work also involves the alleviation of the increased Flash write latency and ECC complexity due to the adoption of MLC and TLC technologies by analyzing the real work workloads

    ์ด์ง„ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ๋ฅผ ์œ„ํ•œ DRAM ๊ธฐ๋ฐ˜์˜ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ ๊ฐ€์†๊ธฐ ๊ตฌ์กฐ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ์œ ์Šน์ฃผ.In the convolutional neural network applications, most computations occurred by the multiplication and accumulation of the convolution and fully-connected layers. From the hardware perspective (i.e., in the gate-level circuits), these operations are performed by many dot-products between the feature map and kernel vectors. Since the feature map and kernel have the matrix form, the vector converted from 3D, or 4D matrices is reused many times for the matrix multiplications. As the throughput of the DNN increases, the power consumption and performance bottleneck due to the data movement become a more critical issue. More importantly, power consumption due to off-chip memory accesses dominates total power since off-chip memory access consumes several hundred times greater power than the computation. The accelerators' throughput is about several hundred GOPS~several TOPS, but Memory bandwidth is less than 25.6 or 34 GB/s (with DDR4 or LPDDR4). By reducing the network size and/or data movement size, both data movement power and performance bottleneck problems are improved. Among the algorithms, Quantization is widely used. Binary Neural Networks (BNNs) dramatically reduce precision down to 1 bit. The accuracy is much lower than that of the FP16, but the accuracy is continuously improving through various studies. With the data flow control, there is a method of reducing redundant data movement by increasing data reuse. The above two methods are widely applied in accelerators because they do not need additional computations in the inference computation. In this dissertation, I present 1) a DRAM-based accelerator architecture and 2) a DRAM refresh method to improve performance reduction due to DRAM refresh. Both methods are orthogonal, so can be integrated into the DRAM chip and operate independently. First, we proposed a DRAM-based accelerator architecture capable of massive and large vector dot product operation. In the field of CNN accelerators to which BNN can be applied, a computing-in-memory (CIM) structure that utilizes a cell-array structure of Memory for vector dot product operation is being actively studied. Since DRAM stores all the neural network data, it is advantageous to reduce the amount of data transfer. The proposed architecture operates by utilizing the basic operation of the DRAM. The second method is to reduce the performance degradation and power consumption caused by DRAM refresh. Since the DRAM cannot read and write data while performing a periodic refresh, system performance decreases. The proposed refresh method tests the refresh characteristics inside the DRAM chip during self-refresh and increases the refresh cycle according to the characteristics. Since it operates independently inside DRAM, it can be applied to all systems using DRAM and is the same for deep neural network accelerators. We surveyed system integration with a software stack to use the in-DRAM accelerator in the DL framework. As a result, it is expected to control in-DRAM accelerators with the memory controller implementation method verified in the previous experiment. Also, we have added the performance simulation function of in-DRAM accelerator to PyTorch. When running a neural network in PyTorch, it reports the computation latency and data movement latency occurring in the layer running in the in-DRAM accelerator. It is a significant advantage to predict the performance when running in hardware while co-designing the network.์ปจ๋ณผ๋ฃจ์…”๋„ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ (CNN) ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ๋Š”, ๋Œ€๋ถ€๋ถ„์˜ ์—ฐ์‚ฐ์ด ์ปจ๋ณผ๋ฃจ์…˜ ๋ ˆ์ด์–ด์™€ ํ’€๋ฆฌ-์ปค๋„ฅํ‹ฐ๋“œ ๋ ˆ์ด์–ด์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๊ณฑ์…ˆ๊ณผ ๋ˆ„์  ์—ฐ์‚ฐ์ด๋‹ค. ๊ฒŒ์ดํŠธ-๋กœ์ง ๋ ˆ๋ฒจ์—์„œ๋Š”, ๋Œ€๋Ÿ‰์˜ ๋ฒกํ„ฐ ๋‚ด์ ์œผ๋กœ ์‹คํ–‰๋˜๋ฉฐ, ์ž…๋ ฅ๊ณผ ์ปค๋„ ๋ฒกํ„ฐ๋“ค์„ ๋ฐ˜๋ณตํ•ด์„œ ์‚ฌ์šฉํ•˜์—ฌ ์—ฐ์‚ฐํ•œ๋‹ค. ๋”ฅ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ ์—ฐ์‚ฐ์—๋Š” ๋ฒ”์šฉ ์—ฐ์‚ฐ ์œ ๋‹›๋ณด๋‹ค, ๋‹จ์ˆœํ•œ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•œ ์ž‘์€ ์—ฐ์‚ฐ ์œ ๋‹›์„ ๋Œ€๋Ÿ‰์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ ํ•ฉํ•˜๋‹ค. ๊ฐ€์†๊ธฐ์˜ ์„ฑ๋Šฅ์ด ์ผ์ • ์ด์ƒ ๋†’์•„์ง€๋ฉด, ๊ฐ€์†๊ธฐ์˜ ์„ฑ๋Šฅ์€ ์—ฐ์‚ฐ์— ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ์ „์†ก์— ์˜ํ•ด ์ œํ•œ๋œ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์˜คํ”„-์นฉ์œผ๋กœ ์ „์†กํ•  ๋•Œ์˜ ์—๋„ˆ์ง€ ์†Œ๋ชจ๊ฐ€, ์—ฐ์‚ฐ ์œ ๋‹›์—์„œ ์—ฐ์‚ฐ์— ์‚ฌ์šฉ๋˜๋Š” ์—๋„ˆ์ง€์˜ ์ˆ˜๋ฐฑ๋ฐฐ๋กœ ํฌ๋‹ค. ๋˜ํ•œ ์—ฐ์‚ฐ๊ธฐ์˜ ์„ฑ๋Šฅ์€ ์ดˆ๋‹น ์ˆ˜๋ฐฑ ๊ธฐ๊ฐ€~์ˆ˜ ํ…Œ๋ผ-์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ๋ฉ”๋ชจ๋ฆฌ์˜ ๋ฐ์ดํ„ฐ ์ „์†ก์€ ์ดˆ๋‹น ์ˆ˜์‹ญ ๊ธฐ๊ฐ€ ๋ฐ”์ดํŠธ์ด๋‹ค. ๋ฐ์ดํ„ฐ ์ „์†ก์— ์˜ํ•œ ํŒŒ์›Œ์™€ ์„ฑ๋Šฅ ๋ฌธ์ œ๋ฅผ ๋™์‹œ์— ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ•์€, ์ „์†ก๋˜๋Š” ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๋ฅผ ์ค„์ด๋Š” ๊ฒƒ์ด๋‹ค. ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘์—์„œ๋Š” ๋„คํŠธ์›Œํฌ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์–‘์žํ™”ํ•˜์—ฌ, ๋‚ฎ์€ ์ •๋ฐ€๋„๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋„๋ฆฌ ์‚ฌ์šฉ๋œ๋‹ค. ์ด์ง„ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ(BNN)๋Š” ์ •๋ฐ€๋„๋ฅผ 1๋น„ํŠธ๊นŒ์ง€ ๊ทน๋‹จ์ ์œผ๋กœ ๋‚ฎ์ถ˜๋‹ค. 16๋น„ํŠธ ์ •๋ฐ€๋„๋ณด๋‹ค ๋„คํŠธ์›Œํฌ์˜ ์ •ํ™•๋„๊ฐ€ ๋‚ฎ์•„์ง€๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์ง€๋งŒ, ๋‹ค์–‘ํ•œ ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด ์ •ํ™•๋„๊ฐ€ ์ง€์†์ ์œผ๋กœ ๊ฐœ์„ ๋˜๊ณ  ์žˆ๋‹ค. ๋˜ํ•œ ๊ตฌ์กฐ์ ์œผ๋กœ๋Š”, ์ „์†ก๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์žฌ์‚ฌ์šฉํ•˜์—ฌ ๋™์ผํ•œ ๋ฐ์ดํ„ฐ์˜ ๋ฐ˜๋ณต์ ์ธ ์ „์†ก์„ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค. ์œ„์˜ ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์€ ์ถ”๋ก  ๊ณผ์ •์—์„œ ๋ณ„๋„์˜ ์—ฐ์‚ฐ ์—†์ด ์ ์šฉ ๊ฐ€๋Šฅํ•˜์—ฌ ๊ฐ€์†๊ธฐ์—์„œ ๋„๋ฆฌ ์ ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š”, DRAM ๊ธฐ๋ฐ˜์˜ ๊ฐ€์†๊ธฐ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•˜๊ณ , DRAM refresh์— ์˜ํ•œ ์„ฑ๋Šฅ ๊ฐ์†Œ๋ฅผ ๊ฐœ์„ ํ•˜๋Š” ๊ธฐ์ˆ ์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋‘ ๋ฐฉ๋ฒ•์€ ํ•˜๋‚˜์˜ DRAM ์นฉ์œผ๋กœ ์ง‘์  ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๋…๋ฆฝ์ ์œผ๋กœ ๊ตฌ๋™ ๊ฐ€๋Šฅํ•˜๋‹ค. ์ฒซ๋ฒˆ์งธ๋Š” ๋Œ€๋Ÿ‰์˜ ๋ฒกํ„ฐ ๋‚ด์  ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•œ DRAM ๊ธฐ๋ฐ˜ ๊ฐ€์†๊ธฐ์— ๋Œ€ํ•œ ์—ฐ๊ตฌ์ด๋‹ค. BNN์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” CNN๊ฐ€์†๊ธฐ ๋ถ„์•ผ์—์„œ, ๋ฉ”๋ชจ๋ฆฌ์˜ ์…€-์–ด๋ ˆ์ด ๊ตฌ์กฐ๋ฅผ ๋ฒกํ„ฐ ๋‚ด์  ์—ฐ์‚ฐ์— ํ™œ์šฉํ•˜๋Š” ์ปดํ“จํŒ…-์ธ-๋ฉ”๋ชจ๋ฆฌ(CIM) ๊ตฌ์กฐ๊ฐ€ ํ™œ๋ฐœํžˆ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋‹ค. ํŠนํžˆ, DRAM์—๋Š” ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ์˜ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋ฐ์ดํ„ฐ ์ „์†ก๋Ÿ‰์˜ ๊ฐ์†Œ์— ์œ ๋ฆฌํ•˜๋‹ค. ์šฐ๋ฆฌ๋Š” DRAM ์…€-์–ด๋ ˆ์ด์˜ ๊ตฌ์กฐ๋ฅผ ๋ฐ”๊พธ์ง€ ์•Š๊ณ , DRAM์˜ ๊ธฐ๋ณธ ๋™์ž‘์„ ํ™œ์šฉํ•˜์—ฌ ์—ฐ์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋‘๋ฒˆ์งธ๋Š” DRAM ๋ฆฌํ”„๋ ˆ์‰ฌ ์ฃผ๊ธฐ๋ฅผ ๋Š˜๋ ค์„œ ์„ฑ๋Šฅ ์—ดํ™”์™€ ํŒŒ์›Œ ์†Œ๋ชจ๋ฅผ ๊ฐœ์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. DRAM์ด ๋ฆฌํ”„๋ ˆ์‰ฌ๋ฅผ ์‹คํ–‰ํ•  ๋•Œ๋งˆ๋‹ค, ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ๊ณ  ์“ธ ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— ์‹œ์Šคํ…œ ํ˜น์€ ๊ฐ€์†๊ธฐ์˜ ์„ฑ๋Šฅ ๊ฐ์†Œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. DRAM ์นฉ ๋‚ด๋ถ€์—์„œ DRAM์˜ ๋ฆฌํ”„๋ ˆ์‰ฌ ํŠน์„ฑ์„ ํ…Œ์ŠคํŠธํ•˜๊ณ , ๋ฆฌํ”„๋ ˆ์‰ฌ ์ฃผ๊ธฐ๋ฅผ ๋Š˜๋ฆฌ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. DRAM ๋‚ด๋ถ€์—์„œ ๋…๋ฆฝ์ ์œผ๋กœ ๋™์ž‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์— DRAM์„ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋“  ์‹œ์Šคํ…œ์— ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๋”ฅ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ ๊ฐ€์†๊ธฐ์—์„œ๋„ ๋™์ผํ•˜๋‹ค. ๋˜ํ•œ, ์ œ์•ˆ๋œ ๊ฐ€์†๊ธฐ๋ฅผ PyTorch์™€ ๊ฐ™์ด ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„ ์›Œํฌ์—์„œ๋„ ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก, ์†Œํ”„ํŠธ์›จ์–ด ์Šคํƒ์„ ๋น„๋กฏํ•œ system integration ๋ฐฉ๋ฒ•์„ ์กฐ์‚ฌํ•˜์˜€๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ, ๊ธฐ์กด์˜ TVM compiler์™€ FPGA๋กœ ๊ตฌํ˜„ํ•˜๋Š” TVM/VTA ๊ฐ€์†๊ธฐ์—, DRAM refresh ์‹คํ—˜์—์„œ ๊ฒ€์ฆ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ์™€ ์ปค์Šคํ…€ ์ปดํŒŒ์ผ๋Ÿฌ๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด in-DRAM ๊ฐ€์†๊ธฐ๋ฅผ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋œ๋‹ค. ์ด์— ๋”ํ•˜์—ฌ, in-DRAM ๊ฐ€์†๊ธฐ์™€ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ์˜ ์„ค๊ณ„ ๋‹จ๊ณ„์—์„œ ์„ฑ๋Šฅ์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋„๋ก, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ธฐ๋Šฅ์„ PyTorch์— ์ถ”๊ฐ€ํ•˜์˜€๋‹ค. PyTorch์—์„œ ์‹ ๊ฒฝ๋ง์„ ์‹คํ–‰ํ•  ๋•Œ, DRAM ๊ฐ€์†๊ธฐ์—์„œ ์‹คํ–‰๋˜๋Š” ๊ณ„์ธต์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๊ณ„์‚ฐ ๋Œ€๊ธฐ ์‹œ๊ฐ„ ๋ฐ ๋ฐ์ดํ„ฐ ์ด๋™ ์‹œ๊ฐ„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.Abstract i Contents viii List of Tables x List of Figures xiv Chapter 1 Introduction 1 Chapter 2 Background 6 2.1 Neural Network Operation . . . . . . . . . . . . . . . . 6 2.2 Data Movement Overhead . . . . . . . . . . . . . . . . 7 2.3 Binary Neural Networks . . . . . . . . . . . . . . . . . 10 2.4 Computing-in-Memory . . . . . . . . . . . . . . . . . . 11 2.5 Memory Bottleneck due to Refresh . . . . . . . . . . . . 13 Chapter 3 In-DRAM Neural Network Accelerator 16 3.1 Backgrounds . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.1 DRAM hierarchy . . . . . . . . . . . . . . . . . 18 3.1.2 DRAM Basic Operation . . . . . . . . . . . . . 21 3.1.3 DRAM Commands with Timing Parameters . . . 22 3.1.4 Bit-wise Operation in DRAM . . . . . . . . . . 25 3.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Proposed architecture . . . . . . . . . . . . . . . . . . . 30 3.3.1 Operation Examples of Row Operator . . . . . . 32 3.3.2 Convolutions on DRAM Chip . . . . . . . . . . 39 3.4 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.1 Input Broadcasting in DRAM . . . . . . . . . . 44 3.4.2 Input Data Movement With M2V . . . . . . . . . 47 3.4.3 Internal Data Movement With SiD . . . . . . . . 49 3.4.4 Data Partitioning for Parallel Operation . . . . . 52 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5.1 Performance Estimation . . . . . . . . . . . . . 56 3.5.2 Configuration of In-DRAM Accelerator . . . . . 58 3.5.3 Improving the Accuracy of BNN . . . . . . . . . 60 3.5.4 Comparison with the Existing Works . . . . . . . 62 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.6.1 Performance Comparison with ASIC Accelerators 67 3.6.2 Challenges of The Proposed Architecture . . . . 70 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 72 Chapter 4 Reducing DRAM Refresh Power Consumption by Runtime Profiling of Retention Time and Dualrow Activation 74 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . 78 4.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5 Solution overview . . . . . . . . . . . . . . . . . . . . . 88 4.6 Runtime profiling . . . . . . . . . . . . . . . . . . . . . 93 4.6.1 Basic Operation . . . . . . . . . . . . . . . . . . 93 4.6.2 Profiling Multiple Rows in Parallel . . . . . . . . 96 4.6.3 Temperature, Data Backup and Error Check . . . 96 4.7 Dual-row Activation . . . . . . . . . . . . . . . . . . . . 98 4.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 102 4.8.1 Experimental Setup . . . . . . . . . . . . . . . . 103 4.8.2 Refresh Period Improvement . . . . . . . . . . . 107 4.8.3 Power Reduction . . . . . . . . . . . . . . . . . 110 4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 5 System Integration 118 5.1 Integrate The Proposed Methods . . . . . . . . . . . . . 118 5.2 Software Stack . . . . . . . . . . . . . . . . . . . . . . 121 Chapter 6 Conclusion 129 Bibliography 131 ๊ตญ๋ฌธ์ดˆ๋ก 153Docto

    High-Performance Energy-Efficient and Reliable Design of Spin-Transfer Torque Magnetic Memory

    Get PDF
    In this dissertation new computing paradigms, architectures and design philosophy are proposed and evaluated for adopting the STT-MRAM technology as highly reliable, energy efficient and fast memory. For this purpose, a novel cross-layer framework from the cell-level all the way up to the system- and application-level has been developed. In these framework, the reliability issues are modeled accurately with appropriate fault models at different abstraction levels in order to analyze the overall failure rates of the entire memory and its Mean Time To Failure (MTTF) along with considering the temperature and process variation effects. Design-time, compile-time and run-time solutions have been provided to address the challenges associated with STT-MRAM. The effectiveness of the proposed solutions is demonstrated in extensive experiments that show significant improvements in comparison to state-of-the-art solutions, i.e. lower-power, higher-performance and more reliable STT-MRAM design

    Adaptation in Standard CMOS Processes with Floating Gate Structures and Techniques

    Get PDF
    We apply adaptation into ordinary circuits and systems to achieve high performance, high quality results. Mismatch in manufactured VLSI devices has been the main limiting factor in quality for many analog and mixed-signal designs. Traditional compensation methods are generally costly. A few examples include enlarging the device size, averaging signals, and trimming with laser. By applying floating gate adaptation to standard CMOS circuits, we demonstrate here that we are able to trim CMOS comparator offset to a precision of 0.7mV, reduce CMOS image sensor fixed-pattern noise power by a factor of 100, and achieve 5.8 effective number of bits (ENOB) in a 6-bit flash analog-to-digital converter (ADC) operating at 750MHz. The adaptive circuits generally exhibit special features in addition to an improved performance. These special features are generally beyond the capabilities of traditional CMOS design approaches and they open exciting opportunities in novel circuit designs. Specifically, the adaptive comparator has the ability to store an accurate arbitrary offset, the image sensor can be set up to memorize previously captured scenes like a human retina, and the ADC can be configured to adapt to the incoming analog signal distribution and perform an efficient signal conversion that minimizes distortion and maximizes output entropy

    Understanding and Improving the Latency of DRAM-Based Memory Systems

    Full text link
    Over the past two decades, the storage capacity and access bandwidth of main memory have improved tremendously, by 128x and 20x, respectively. These improvements are mainly due to the continuous technology scaling of DRAM (dynamic random-access memory), which has been used as the physical substrate for main memory. In stark contrast with capacity and bandwidth, DRAM latency has remained almost constant, reducing by only 1.3x in the same time frame. Therefore, long DRAM latency continues to be a critical performance bottleneck in modern systems. Increasing core counts, and the emergence of increasingly more data-intensive and latency-critical applications further stress the importance of providing low-latency memory access. In this dissertation, we identify three main problems that contribute significantly to long latency of DRAM accesses. To address these problems, we present a series of new techniques. Our new techniques significantly improve both system performance and energy efficiency. We also examine the critical relationship between supply voltage and latency in modern DRAM chips and develop new mechanisms that exploit this voltage-latency trade-off to improve energy efficiency. The key conclusion of this dissertation is that augmenting DRAM architecture with simple and low-cost features, and developing a better understanding of manufactured DRAM chips together lead to significant memory latency reduction as well as energy efficiency improvement. We hope and believe that the proposed architectural techniques and the detailed experimental data and observations on real commodity DRAM chips presented in this dissertation will enable development of other new mechanisms to improve the performance, energy efficiency, or reliability of future memory systems.Comment: PhD Dissertatio
    corecore