57 research outputs found

    상변화 메모리 시스템의 간섭 오류 완화 및 RMW 성능 향상 기법

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2021.8. 이혁재.Phase-change memory (PCM) announces the beginning of the new era of memory systems, owing to attractive characteristics. Many memory product manufacturers (e.g., Intel, SK Hynix, and Samsung) are developing related products. PCM can be applied to various circumstances; it is not simply limited to an extra-scale database. For example, PCM has a low standby power due to its non-volatility; hence, computation-intensive applications or mobile applications (i.e., long memory idle time) are suitable to run on PCM-based computing systems. Despite these fascinating features of PCM, PCM is still far from the general commercial market due to low reliability and long latency problems. In particular, low reliability is a painful problem for PCM in past decades. As the semiconductor process technology rapidly scales down over the years, DRAM reaches 10 nm class process technology. In addition, it is reported that the write disturbance error (WDE) would be a serious issue for PCM if it scales down below 54 nm class process technology. Therefore, addressing the problem of WDEs becomes essential to make PCM competitive to DRAM. To overcome this problem, this dissertation proposes a novel approach that can restore meta-stable cells on demand by levering two-level SRAM-based tables, thereby significantly reducing the number WDEs. Furthermore, a novel randomized approach is proposed to implement a replacement policy that originally requires hundreds of read ports on SRAM. The second problem of PCM is a long-latency compared to that of DRAM. In particular, PCM tries to enhance its throughput by adopting a larger transaction unit; however, the different unit size from the general-purpose processor cache line further degrades the system performance due to the introduction of a read-modify-write (RMW) module. Since there has never been any research related to RMW in a PCM-based memory system, this dissertation proposes a novel architecture to enhance the overall system performance and reliability of a PCM-based memory system having an RMW module. The proposed architecture enhances data re-usability without introducing extra storage resources. Furthermore, a novel operation that merges commands regardless of command types is proposed to enhance performance notably. Another problem is the absence of a full simulation platform for PCM. While the announced features of the PCM-related product (i.e., Intel Optane) are scarce due to confidential issues, all priceless information can be integrated to develop an architecture simulator that resembles the available product. To this end, this dissertation tries to scrape up all available features of modules in a PCM controller and implement a dedicated simulator for future research purposes.상변화 메모리는(PCM) 매력적인 특성을 통해 메모리 시스템의 새로운 시대의 시작을 알렸다. 많은 메모리 관련 제품 제조업체(예 : 인텔, SK 하이닉스, 삼성)가 관련 제품 개발에 박차를 가하고 있다. PCM은 단순히 대규모 데이터베이스에만 국한되지 않고 다양한 상황에 적용될 수 있다. 예를 들어, PCM은 비휘발성으로 인해 대기 전력이 낮다. 따라서 계산 집약적인 애플리케이션 또는 모바일 애플리케이션은(즉, 긴 메모리 유휴 시간) PCM 기반 컴퓨팅 시스템에서 실행하기에 적합하다. PCM의 이러한 매력적인 특성에도 불구하고 PCM은 낮은 신뢰성과 긴 대기 시간으로 인해 여전히 일반 산업 시장에서는 DRAM과 다소 격차가 있다. 특히 낮은 신뢰성은 지난 수십 년 동안 PCM 기술의 발전을 저해하는 문제다. 반도체 공정 기술이 수년에 걸쳐 빠르게 축소됨에 따라 DRAM은 10nm 급 공정 기술에 도달하였다. 이어서, 쓰기 방해 오류 (WDE)가 54nm 등급 프로세스 기술 아래로 축소되면 PCM에 심각한 문제가 될 것으로 보고되었다. 따라서, WDE 문제를 해결하는 것은 PCM이 DRAM과 동등한 경쟁력을 갖추도록 하는 데 있어 필수적이다. 이 문제를 극복하기 위해 이 논문에서는 2-레벨 SRAM 기반 테이블을 활용하여 WDE 수를 크게 줄여 필요에 따라 준 안정 셀을 복원할 수 있는 새로운 접근 방식을 제안한다. 또한, 원래 SRAM에서 수백 개의 읽기 포트가 필요한 대체 정책을 구현하기 위해 새로운 랜덤 기반의 기법을 제안한다. PCM의 두 번째 문제는 DRAM에 비해 지연 시간이 길다는 것이다. 특히 PCM은 더 큰 트랜잭션 단위를 채택하여 단위시간 당 데이터 처리량 향상을 도모한다. 그러나 범용 프로세서 캐시 라인과 다른 유닛 크기는 읽기-수정-쓰기 (RMW) 모듈의 도입으로 인해 시스템 성능을 저하하게 된다. PCM 기반 메모리 시스템에서 RMW 관련 연구가 없었기 때문에 본 논문은 RMW 모듈을 탑재 한 PCM 기반 메모리 시스템의 전반적인 시스템 성능과 신뢰성을 향상하게 시킬 수 있는 새로운 아키텍처를 제안한다. 제안된 아키텍처는 추가 스토리지 리소스를 도입하지 않고도 데이터 재사용성을 향상시킨다. 또한, 성능 향상을 위해 명령 유형과 관계없이 명령을 병합하는 새로운 작업을 제안한다. 또 다른 문제는 PCM을 위한 완전한 시뮬레이션 플랫폼이 부재하다는 것이다. PCM 관련 제품(예 : Intel Optane)에 대해 발표된 정보는 대외비 문제로 인해 부족하다. 하지만 알려져 있는 정보를 적절히 취합하면 시중 제품과 유사한 아키텍처 시뮬레이터를 개발할 수 있다. 이를 위해 본 논문은 PCM 메모리 컨트롤러에 필요한 모든 모듈 정보를 활용하여 향후 이와 관련된 연구에서 충분히 사용 가능한 전용 시뮬레이터를 구현하였다.1 INTRODUCTION 1 1.1 Limitation of Traditional Main Memory Systems 1 1.2 Phase-Change Memory as Main Memory 3 1.2.1 Opportunities of PCM-based System 3 1.2.2 Challenges of PCM-based System 4 1.3 Dissertation Overview 7 2 BACKGROUND AND PREVIOUS WORK 8 2.1 Phase-Change Memory 8 2.2 Mitigation Schemes for Write Disturbance Errors 10 2.2.1 Write Disturbance Errors 10 2.2.2 Verification and Correction 12 2.2.3 Lazy Correction 13 2.2.4 Data Encoding-based Schemes 14 2.2.5 Sparse-Insertion Write Cache 16 2.3 Performance Enhancement for Read-Modify-Write 17 2.3.1 Traditional Read-Modify-Write 17 2.3.2 Write Coalescing for RMW 19 2.4 Architecture Simulators for PCM 21 2.4.1 NVMain 21 2.4.2 Ramulator 22 2.4.3 DRAMsim3 22 3 IN-MODULE DISTURBANCE BARRIER 24 3.1 Motivation 25 3.2 IMDB: In Module-Disturbance Barrier 29 3.2.1 Architectural Overview 29 3.2.2 Implementation of Data Structures 30 3.2.3 Modification of Media Controller 36 3.3 Replacement Policy 38 3.3.1 Replacement Policy for IMDB 38 3.3.2 Approximate Lowest Number Estimator 40 3.4 Putting All Together: Case Studies 43 3.5 Evaluation 45 3.5.1 Configuration 45 3.5.2 Architectural Exploration 47 3.5.3 Effectiveness of the Replacement Policy 48 3.5.4 Sensitivity to Main Table Configuration 49 3.5.5 Sensitivity to Barrier Buffer Size 51 3.5.6 Sensitivity to AppLE Group Size 52 3.5.7 Comparison with Other Studies 54 3.6 Discussion 59 3.7 Summary 63 4 INTEGRATION OF AN RMW MODULE IN A PCM-BASED SYSTEM 64 4.1 Motivation 65 4.2 Utilization of DRAM Cache for RMW 67 4.2.1 Architectural Design 67 4.2.2 Algorithm 70 4.3 Typeless Command Merging 73 4.3.1 Architectural Design 73 4.3.2 Algorithm 74 4.4 An Alternative Implementation: SRC-RMW 78 4.4.1 Implementation of SRC-RMW 78 4.4.2 Design Constraint 80 4.5 Case Study 82 4.6 Evaluation 85 4.6.1 Configuration 85 4.6.2 Speedup 88 4.6.3 Read Reliability 91 4.6.4 Energy Consumption: Selecting a Proper Page Size 93 4.6.5 Comparison with Other Studies 95 4.7 Discussion 97 4.8 Summary 99 5 AN ALL-INCLUSIVE SIMULATOR FOR A PCM CONTROLLER 100 5.1 Motivation 101 5.2 PCMCsim: PCM Controller Simulator 103 5.2.1 Architectural Overview 103 5.2.2 Underlying Classes of PCMCsim 104 5.2.3 Implementation of Contention Behavior 108 5.2.4 Modules of PCMCsim 109 5.3 Evaluation 116 5.3.1 Correctness of the Simulator 116 5.3.2 Comparison with Other Simulators 117 5.4 Summary 119 6 Conclusion 120 Abstract (In Korean) 141 Acknowledgment 143박

    상변화 메모리 시스템의 간섭 오류 완화 및 RMW 성능 향상 기법

    Get PDF
    학위논문(박사)--서울대학교 대학원 :공과대학 전기·정보공학부,2021. 8. 이혁재.Phase-change memory (PCM) announces the beginning of the new era of memory systems, owing to attractive characteristics. Many memory product manufacturers (e.g., Intel, SK Hynix, and Samsung) are developing related products. PCM can be applied to various circumstances; it is not simply limited to an extra-scale database. For example, PCM has a low standby power due to its non-volatility; hence, computation-intensive applications or mobile applications (i.e., long memory idle time) are suitable to run on PCM-based computing systems. The second problem of PCM is a long-latency compared to that of DRAM. In particular, PCM tries to enhance its throughput by adopting a larger transaction unit; however, the different unit size from the general-purpose processor cache line further degrades the system performance due to the introduction of a read-modify-write (RMW) module. Since there has never been any research related to RMW in a PCM-based memory system, this dissertation proposes a novel architecture to enhance the overall system performance and reliability of a PCM-based memory system having an RMW module. The proposed architecture enhances data re-usability without introducing extra storage resources. Furthermore, a novel operation that merges commands regardless of command types is proposed to enhance performance notably.Despite these fascinating features of PCM, PCM is still far from the general commercial market due to low reliability and long latency problems. In particular, low reliability is a painful problem for PCM in past decades. As the semiconductor process technology rapidly scales down over the years, DRAM reaches 10 nm class process technology. In addition, it is reported that the write disturbance error (WDE) would be a serious issue for PCM if it scales down below 54 nm class process technology. Therefore, addressing the problem of WDEs becomes essential to make PCM competitive to DRAM. To overcome this problem, this dissertation proposes a novel approach that can restore meta-stable cells on demand by levering two-level SRAM-based tables, thereby significantly reducing the number WDEs. Furthermore, a novel randomized approach is proposed to implement a replacement policy that originally requires hundreds of read ports on SRAM.Another problem is the absence of a full simulation platform for PCM. While the announced features of the PCM-related product (i.e., Intel Optane) are scarce due to confidential issues, all priceless information can be integrated to develop an architecture simulator that resembles the available product. To this end, this dissertation tries to scrape up all available features of modules in a PCM controller and implement a dedicated simulator for future research purposes

    Improving Phase Change Memory (PCM) and Spin-Torque-Transfer Magnetic-RAM (STT-MRAM) as Next-Generation Memories: A Circuit Perspective

    Get PDF
    In the memory hierarchy of computer systems, the traditional semiconductor memories Static RAM (SRAM) and Dynamic RAM (DRAM) have already served for several decades as cache and main memory. With technology scaling, they face increasingly intractable challenges like power, density, reliability and scalability. As a result, they become less appealing in the multi/many-core era with ever increasing size and memory-intensity of working sets. Recently, there is an increasing interest in using emerging non-volatile memory technologies in replacement of SRAM and DRAM, due to their advantages like non-volatility, high device density, near-zero cell leakage and resilience to soft errors. Among several new memory technologies, Phase Change Memory (PCM) and Spin-Torque-Transfer Magnetic-RAM (STT-MRAM) are most promising candidates in building main memory and cache, respectively. However, both of them possess unique limitations that preventing them from being effectively adopted. In this dissertation, I present my circuit design work on tackling the limitations of PCM and STT-MRAM. At bit level, both PCM and STT-MRAM suffer from excessive write energy, and PCM has very limited write endurance. For PCM, I implement Differential Write to remove large number of unnecessary bit-writes that do not alter the stored data. It is then extended to STT-MRAM as Early Write Termination, with specific optimizations to eliminate the overhead of pre-write read. At array level, PCM enjoys high density but could not provide competitive throughput due to its long write latency and limited number of read/write circuits. I propose a Pseudo-Multi-Port Bank design to exploit intra-bank parallelism by recycling and reusing shared peripheral circuits between accesses in a time-multiplexed manner. On the other hand, although STT-MRAM features satisfactory throughput, its conventional array architecture is constrained on density and scalability by the pitch of the per-column bitline pair. I propose a Common-Source-Line Array architecture which uses a shared source-line along the row, essentially leaving only one bitline per column. For these techniques, I provide circuit level analyses as well as architecture/system level and/or process/device level discussions. In addition, relevant background and work are thoroughly surveyed and potential future research topics are discussed, offering insights and prospects of these next-generation memories

    Bit-Flip Aware Data Structures for Phase Change Memory

    Get PDF
    Big, non-volatile, byte-addressable, low-cost, and fast non-volatile memories like Phase Change Memory are appearing in the marketplace. They have the capability to unify both memory and storage and allow us to rethink the present memory hierarchy. An important draw-back to Phase Change Memory is limited write-endurance. In addition, Phase Change Memory shares with other Non-Volatile Random Access Memories an asym- metry in the energy costs of writes and reads. Best use of Non-Volatile Random Access Memories limits the number of times a Non-Volatile Random Access Memory cell changes contents, called a bit-flip. While the future of main memory is still unknown, we should already start to create data structures for them in order to shape the future era. This thesis investigates the creation of bit-flip aware data structures.The thesis first considers general ways in which a data structure can save bit- flips by smart overwrites and by using the exclusive-or of pointers. It then shows how a simple content dependent encoding can reduce bit-flips for web corpora. It then shows how to build hash based dictionary structures for Linear Hashing and Spiral Storage. Finally, the thesis presents Gray counters, close to bit-flip optimal counters that even enable age- based wear leveling with counters managed by the Non-Volatile Random Access Memories themselves instead of by the Operating Systems

    Towards Design and Analysis For High-Performance and Reliable SSDs

    Get PDF
    NAND Flash-based Solid State Disks have many attractive technical merits, such as low power consumption, light weight, shock resistance, sustainability of hotter operation regimes, and extraordinarily high performance for random read access, which makes SSDs immensely popular and be widely employed in different types of environments including portable devices, personal computers, large data centers, and distributed data systems. However, current SSDs still suffer from several critical inherent limitations, such as the inability of in-place-update, asymmetric read and write performance, slow garbage collection processes, limited endurance, and degraded write performance with the adoption of MLC and TLC techniques. To alleviate these limitations, we propose optimizations from both specific outside applications layer and SSDs\u27 internal layer. Since SSDs are good compromise between the performance and price, so SSDs are widely deployed as second layer caches sitting between DRAMs and hard disks to boost the system performance. Due to the special properties of SSDs such as the internal garbage collection processes and limited lifetime, traditional cache devices like DRAM and SRAM based optimizations might not work consistently for SSD-based cache. Therefore, for the outside applications layer, our work focus on integrating the special properties of SSDs into the optimizations of SSD caches. Moreover, our work also involves the alleviation of the increased Flash write latency and ECC complexity due to the adoption of MLC and TLC technologies by analyzing the real work workloads

    Cost effective technology applied to domotics and smart home energy management systems

    Get PDF
    Premio extraordinario de Trabajo Fin de Máster curso 2019/2020. Máster en Energías Renovables DistribuidasIn this document is presented the state of art for domotics cost effective technologies available on market nowadays, and how to apply them in Smart Home Energy Management Systems (SHEMS) allowing peaks shaving, renewable management and home appliance controls, always in cost effective context in order to be massively applied. Additionally, beyond of SHEMS context, it will be also analysed how to apply this technology in order to increase homes energy efficiency and monitoring of home appliances. Energy management is one of the milestones for distributed renewable energy spread; since renewable energy sources are not time-schedulable, are required control systems capable of the management for exchanging energy between conventional sources (power grid), renewable sources and energy storage sources. With the proposed approach, there is a first block dedicated to show an overview of Smart Home Energy Management Systems (SMHEMS) classical architecture and functional modules of SHEMS; next step is to analyse principles which has allowed some devices to become a cost-effective technology. Once the technology has been analysed, it will be reviewed some specific resources (hardware and software) available on marked for allowing low cost SHEMS. Knowing the “tools” available; it will be shown how to adapt classical SHEMS to cost effective technology. Such way, this document will show some specific applications of SHEMS. Firstly, in a general point of view, comparing the proposed low-cost technology with one of the main existing commercial proposals; and secondly, developing the solution for a specific real case.En este documento se aborda el estado actual de la domótica de bajo coste disponible en el mercado actualmente y cómo aplicarlo en los sistemas inteligentes de gestión energética en la vivienda (SHEMS) permitiendo el recorte de las puntas de demanda, gestión de energías renovables y control de electrodomésticos, siempre en el contexto del bajo coste, con el objetivo de lograr la máxima difusión de los SHEMS. Adicionalmente, más allá del contexto de la tecnología SHEMS, se analizará cómo aplicar esta tecnología para aumentar la eficiencia energética de los hogares y para la supervisión de los electrodomésticos. La gestión energética es uno de los factores principales para lograr la difusión de las energías renovables distribuidas; debido a que las fuentes de energía renovable no pueden ser planificadas, se requieren sistemas de control capaces de gestionar el intercambio de energía entre las fuentes convencionales (red eléctrica de distribución), energías renovables y dispositivos de almacenamiento energético. Bajo esta perspectiva, este documento presenta un primer bloque en el que se exponen las bases de la arquitectura y módulos funcionales de los sistemas inteligentes de gestión energética en la vivienda (SHEMS); el siguiente paso será analizar los principios que han permitido a ciertos dispositivos convertirse en dispositivos de bajo coste. Una vez analizada la tecnología, nos centraremos en los recursos (hardware y software) existentes que permitirán la realización de un SHEMS a bajo coste. Conocidas las “herramientas” a nuestra disposición, se mostrará como adaptar un esquema SHEMS clásico a la tecnología de bajo coste. Primeramente, comparando de modo genérico la tecnología de bajo coste con una de las principales propuestas comerciales de SHEMS, para seguidamente desarrollar la solución de bajo coste a un caso específico real

    High Performance On-Chip Interconnects Design for Future Many-Core Architectures

    Get PDF
    Switch-based Network-on-Chip (NoC) is a widely accepted inter-core communication infrastructure for Chip Multiprocessors (CMPs). With the continued advance of CMOS technology, the number of cores on a single chip keeps increasing at a rapid pace. It is highly expected that many-core architectures with more than hundreds of processor cores will be commercialized in the near future. In such a large scale CMP system, NoC overheads are more dominant than computation power in determining overall system performance. Also, for modern computational workloads requiring abundant thread level parallelism (TLP), NoC design for highly-parallel, many-core accelerators such as General Purpose Graphics Processing Units (GPGPUs) is of primary importance in harnessing the potential of massive thread- and data-level parallelism. In these contexts, it is critical that NoC provides both low latency and high bandwidth within limited on-chip power and area budgets. In this dissertation, we explore various design issues inherent in future many-core architectures, CMPs and GPGPUs, to achieve both high performance and power efficiency. First, we deal with issues in using a promising next generation memory technology, Spin-Transfer Torque Magnetic RAM (STT-MRAM), for NoC input buffers in CMPs. Using a high density and low leakage memory offers more buffer capacities with the same die footprint, thus helping increase network throughput in NoC routers. However, its long latency and high power consumption in write operations still need to be addressed. Thus, we propose a hybrid design of input buffers using both SRAM and STT-MRAM to hide the long write latency efficiently. Considering that simple data migration in the hybrid buffer consumes more dynamic power compared to SRAM, we provide a lazy migration scheme that reduces the dynamic power consumption of the hybrid buffer. Second, we propose the first NoC router design that uses only STT-MRAM, providing much larger buffer space with less power consumption, while preserving data integrity. To hide the multicycle writes, we employ a multibank STT-MRAM buffer, a virtual channel with multiple banks where every incoming flit is seamlessly pipelined to each bank alternately. Our STT-MRAM design has aggressively reduced the retention time, resulting in a significant reduction in the latency and power overheads of write operations. To ensure data integrity against inadvertent bit flips from the thermal fluctuation during the given retention time, we propose a cost-efficient dynamic buffer refresh scheme combined with Error Correcting Codes (ECC) to detect and correct data corruption. Third, we present schemes for bandwidth-efficient on-chip interconnects in GPGPUs. GPGPUs place a heavy demand on the on-chip interconnect between the many cores and a few memory controllers (MCs). Thus, traffic is highly asymmetric, impacting on-chip resource utilization and system performance. Here, we analyze the communication demands of typical GPGPU applications, and propose efficient NoC designs to meet those demands

    Acceleration Techniques for Sparse Recovery Based Plane-wave Decomposition of a Sound Field

    Get PDF
    Plane-wave decomposition by sparse recovery is a reliable and accurate technique for plane-wave decomposition which can be used for source localization, beamforming, etc. In this work, we introduce techniques to accelerate the plane-wave decomposition by sparse recovery. The method consists of two main algorithms which are spherical Fourier transformation (SFT) and sparse recovery. Comparing the two algorithms, the sparse recovery is the most computationally intensive. We implement the SFT on an FPGA and the sparse recovery on a multithreaded computing platform. Then the multithreaded computing platform could be fully utilized for the sparse recovery. On the other hand, implementing the SFT on an FPGA helps to flexibly integrate the microphones and improve the portability of the microphone array. For implementing the SFT on an FPGA, we develop a scalable FPGA design model that enables the quick design of the SFT architecture on FPGAs. The model considers the number of microphones, the number of SFT channels and the cost of the FPGA and provides the design of a resource optimized and cost-effective FPGA architecture as the output. Then we investigate the performance of the sparse recovery algorithm executed on various multithreaded computing platforms (i.e., chip-multiprocessor, multiprocessor, GPU, manycore). Finally, we investigate the influence of modifying the dictionary size on the computational performance and the accuracy of the sparse recovery algorithms. We introduce novel sparse-recovery techniques which use non-uniform dictionaries to improve the performance of the sparse recovery on a parallel architecture
    corecore