Search CORE

282 research outputs found

Exploiting Inter- and Intra-Memory Asymmetries for Data Mapping in Hybrid Tiered-Memories

Author: Antognetti P.
Arafa M.
Arjomand M.
Bhattacharyya A.
Blagodurov S.
Cao Y.
Chang Y.-M.
Cho B.-H.
Das A.
Das A.
Dray C.
Goda A.
Huang Y.
Jayasena N. S.
Kang U.
Kim Y.
Lee D.
Mallik A.
Mutlu O.
Mutlu O.
Pourshirazi B.
Qureshi M. K.
Qureshi M. K.
Redaelli A.
Rixner S.
Sandhu B. S.
Seong N. H.
Seshadri V.
Srinivasan J.
Stuecheli J.
Yoon H.
Yue J.
Zhang L.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 10/05/2020
Field of study

Modern computing systems are embracing hybrid memory comprising of DRAM and non-volatile memory (NVM) to combine the best properties of both memory technologies, achieving low latency, high reliability, and high density. A prominent characteristic of DRAM-NVM hybrid memory is that it has NVM access latency much higher than DRAM access latency. We call this inter-memory asymmetry. We observe that parasitic components on a long bitline are a major source of high latency in both DRAM and NVM, and a significant factor contributing to high-voltage operations in NVM, which impact their reliability. We propose an architectural change, where each long bitline in DRAM and NVM is split into two segments by an isolation transistor. One segment can be accessed with lower latency and operating voltage than the other. By introducing tiers, we enable non-uniform accesses within each memory type (which we call intra-memory asymmetry), leading to performance and reliability trade-offs in DRAM-NVM hybrid memory. We extend existing NVM-DRAM OS in three ways. First, we exploit both inter- and intra-memory asymmetries to allocate and migrate memory pages between the tiers in DRAM and NVM. Second, we improve the OS's page allocation decisions by predicting the access intensity of a newly-referenced memory page in a program and placing it to a matching tier during its initial allocation. This minimizes page migrations during program execution, lowering the performance overhead. Third, we propose a solution to migrate pages between the tiers of the same memory without transferring data over the memory channel, minimizing channel occupancy and improving performance. Our overall approach, which we call MNEME, to enable and exploit asymmetries in DRAM-NVM hybrid tiered memory improves both performance and reliability for both single-core and multi-programmed workloads.Comment: 15 pages, 29 figures, accepted at ACM SIGPLAN International Symposium on Memory Managemen

arXiv.org e-Print Archive

Crossref

Increasing Off-Chip Bandwidth and Mitigating Dark Silicon via Switchable Pins

Author: Chen Shaoming
Publication venue: LSU Digital Commons
Publication date: 01/01/2016
Field of study

Off-chip memory bandwidth has been considered as one of the major limiting factors to processor performance, especially for multi-cores and many-cores. Conventional processor design allocates a large portion of off-chip pins to deliver power, leaving a small number of pins for processor signal communication. We observed that the processor requires much less power than that can be supplied during memory intensive stages in some cases. In this work, we propose a dynamic pin switch technique to alleviate the bandwidth limitation issue. The technique is introduced to dynamically exploit the surplus pins for power delivery in the memory intensive phases and uses them to provide extra bandwidth for the program executions, thus significantly boosting the performance. We also explore its performance benefit in the era of Phase-change memory (PCM) and prove that the technique can be applied beyond DRAM-based memory systems. On the other hand, the end of Dennard Scaling has led to a large amount of inactive or significantly under-clocked transistors on modern chip multi-processors in order to comply with the power budget and prevent the processors from overheating. This so-called “dark silicon” is one of the most critical constraints that will hinder the scaling with Moore’s Law in the future. While advanced cooling techniques, such as liquid cooling, can effectively decrease the chip temperature and alleviate the power constraints; the peak performance, determined by the maximum number of transistors which are allowed to switch simultaneously, is still confined by the amount of power pins on the chip package. In this paper, we propose a novel mechanism to power up the dark silicon by dynamically switching a portion of I/O pins to power pins when off-chip communications are less frequent. By enabling extra cores or increasing processor frequency, the proposed strategy can significantly boost performance compared with traditional designs. Using the switchable pins can increase inter-socket bandwidth as one of performance bottlenecks. Multi-socket computer systems are popular in workstations and servers. However, they suffer from the relatively low bandwidth of inter-socket communication especially for massive parallel workloads that generates many inter-socket requests for synchronizations and remote memory accesses. The inter-socket traffic poses a huge pressure on the underlying networks fully connecting all processors with the limited bandwidth that is confined by pin resources. Given the constraint, we propose to dynamically increase the inter-socket band-width, trading off with lower off-chip memory bandwidth when the systems have heavy inter-socket communication but few off-chip memory accesses. The design increases the physical bandwidth of inter-socket communication via switching the function of pins from off-chip memory accesses to inter-socket communication

Louisiana State University

성능과 용량 향상을 위한 적층형 메모리 구조

Author: 이석한
Publication venue: 서울대학교 대학원
Publication date: 01/02/2019
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 융합과학기술대학원 융합과학부(지능형융합시스템전공), 2019. 2. 안정호.The advance of DRAM manufacturing technology slows down, whereas the density and performance needs of DRAM continue to increase. This desire has motivated the industry to explore emerging Non-Volatile Memory (e.g., 3D XPoint) and the high-density DRAM (e.g., Managed DRAM Solution). Since such memory technologies increase the density at the cost of longer latency, lower bandwidth, or both, it is essential to use them with fast memory (e.g., conventional DRAM) to which hot pages are transferred at runtime. Nonetheless, we observe that page transfers to fast memory often block memory channels from servicing memory requests from applications for a long period. This in turn significantly increases the high-percentile response time of latency-sensitive applications. In this thesis, we propose a high-density managed DRAM architecture, dubbed 3D-XPath for applications demanding both low latency and high capacity for memory. 3D-XPath DRAM stacks conventional DRAM dies with high-density DRAM dies explored in this thesis and connects these DRAM dies with 3D-XPath. Especially, 3D-XPath allows unused memory channels to service memory requests from applications when primary channels supposed to handle the memory requests are blocked by page transfers at given moments, considerably increasing the high-percentile response time. This can also improve the throughput of applications frequently copying memory blocks between kernel and user memory spaces. Our evaluation shows that 3D-XPath DRAM decreases high-percentile response time of latency-sensitive applications by ∼30% while improving the throughput of an I/O-intensive applications by ∼39%, compared with DRAM without 3D-XPath. Recent computer systems are evolving toward the integration of more CPU cores into a single socket, which require higher memory bandwidth and capacity. Increasing the number of channels per socket is a common solution to the bandwidth demand and to better utilize these increased channels, data bus width is reduced and burst length is increased. However, this longer burst length brings increased DRAM access latency. On the memory capacity side, process scaling has been the answer for decades, but cell capacitance now limits how small a cell could be. 3D stacked memory solves this problem by stacking dies on top of other dies. We made a key observation in real multicore machine that multiple memory controllers are always not fully utilized on SPEC CPU 2006 rate benchmark. To bring these idle channels into play, we proposed memory channel sharing architecture to boost peak bandwidth of one memory channel and reduce the burst latency on 3D stacked memory. By channel sharing, the total performance on multi-programmed workloads and multi-threaded workloads improved up to respectively 4.3% and 3.6% and the average read latency reduced up to 8.22% and 10.18%.DRAM 제조 기술의 발전은 속도가 느려지는 반면 DRAM의 밀도 및 성능 요구는 계속 증가하고 있다. 이러한 요구로 인해 새로운 비 휘발성 메모리(예: 3D-XPoint) 및 고밀도 DRAM(예: Managed asymmetric latency DRAM Solution)이 등장하였다. 이러한 고밀도 메모리 기술은 긴 레이턴시, 낮은 대역폭 또는 두 가지 모두를 사용하는 방식으로 밀도를 증가시키기 때문에 성능이 좋지 않아, 핫 페이지를 고속 메모리(예: 일반 DRAM)로 스왑되는 저용량의 고속 메모리가 동시에 사용되는 것이 일반적이다. 이러한 스왑 과정에서 빠른 메모리로의 페이지 전송이 일반적인 응용프로그램의 메모리 요청을 오랫동안 처리하지 못하도록 하기 때문에, 대기 시간에 민감한 응용 프로그램의 백분위 응답 시간을 크게 증가시켜, 응답 시간의 표준 편차를 증가시킨다. 이러한 문제를 해결하기 위해 본 학위 논문에서는 저 지연시간 및 고용량 메모리를 요구하는 애플리케이션을 위해 3D-XPath, 즉 고밀도 관리 DRAM 아키텍처를 제안한다. 이러한 3D-톔소를 집적한 DRAM은 저속의 고밀도 DRAM 다이를 기존의 일반적인 DRAM 다이와 동시에 한 칩에 적층하고, DRAM 다이끼리는 제안하는 3D-XPath 하드웨어를 통해 연결된다. 이러한 3D-XPath는 핫 페이지 스왑이 일어나는 동안 응용프로그램의 메모리 요청을 차단하지 않고 사용량이 적은 메모리 채널로 핫 페이지 스왑을 처리 할 수 있도록 하여, 데이터 집중 응용 프로그램의 백분위 응답 시간을 개선시킨다. 또한 제안하는 하드웨어 구조를 사용하여, 추가적으로 O/S 커널과 유저 스페이스 간의 메모리 블록을 자주 복사하는 응용 프로그램의 처리량을 향상시킬 수 있다. 이러한 3D-XPath DRAM은 3D-XPath가 없는 DRAM에 비해 I/O 집약적인 응용프로그램의 처리량을 최대 39 % 향상시키면서 레이턴시에 민감한 응용 프로그램의 높은 백분위 응답 시간을 최대 30 %까지 감소시킬 수 있다. 또한 최근의 컴퓨터 시스템은 보다 많은 메모리 대역폭과 용량을 필요로하는 더 많은 CPU 코어를 단일 소켓으로 통합하는 방향으로 진화하고 있다. 이러한 소켓 당 채널 수를 늘리는 것은 대역폭 요구에 대한 일반적인 해결책이며, 최신의 DRAM 인터페이스의 발전 양상은 증가한 채널을 보다 잘 활용하기 위해 데이터 버스 폭이 감소되고 버스트 길이가 증가한다. 그러나 길어진 버스트 길이는 DRAM 액세스 대기 시간을 증가시킨다. 추가적으로 최신의 응용프로그램은 더 많은 메모리 용량을 요구하며, 미세 공정으로 메모리 용량을 증가시키는 방법론은 수십 년 동안 사용되었지만, 20 nm 이하의 미세공정에서는 더 이상 공정 미세화를 통해 메모리 밀도를 증가시키기가 어려운 상황이며, 적층형 메모리를 사용하여 용량을 증가시키는 방법을 사용한다. 이러한 상황에서, 실제 최신의 멀티코어 머신에서 SPEC CPU 2006 응용프로그램을 멀티코어에서 실행하였을 때, 항상 시스템의 모든 메모리 컨트롤러가 완전히 활용되지 않는다는 사실을 관찰했다. 이러한 유휴 채널을 사용하기 위해 하나의 메모리 채널의 피크 대역폭을 높이고 3D 스택 메모리의 버스트 대기 시간을 줄이기 위해 본 학위 논문에서는 메모리 채널 공유 아키텍처를 제안하였으며, 하드웨어 블록을 제안하였다. 이러한 채널 공유를 통해 멀티 프로그램 된 응용프로그램 및 다중 스레드 응용프로그램 성능이 각각 4.3 % 및 3.6 %로 향상되었으며 평균 읽기 대기 시간은 8.22 % 및 10.18 %로 감소하였다.Contents Abstract i Contents iv List of Figures vi List of Tables viii Introduction 1 1.1 3D-XPath: High-Density Managed DRAM Architecture with Cost-effective Alternative Paths for Memory Transactions 5 1.2 Boosting Bandwidth – Dynamic Channel Sharing on 3D Stacked Memory 9 1.3 Research contribution 13 1.4 Outline 14 3D-stacked Heterogeneous Memory Architecture with Cost-effective Extra Block Transfer Paths 17 2.1 Background 17 2.1.1 Heterogeneous Main Memory Systems 17 2.1.2 Specialized DRAM 19 2.1.3 3D-stacked Memory 22 2.2 HIGH-DENSITY DRAM ARCHITECTURE 27 2.2.1 Key Design Challenges 29 2.2.2 Plausible High-density DRAM Designs 33 2.3 3D-STACKED DRAM WITH ALTERNATIVE PATHS FOR MEMORY TRANSACTIONS 37 2.3.1 3D-XPath Architecture 41 2.3.2 3D-XPath Management 46 2.4 EXPERIMENTAL METHODOLOGY 52 2.5 EVALUATION 56 2.5.1 OLDI Workloads 56 2.5.2 Non-OLDI Workloads 61 2.5.3 Sensitivity Analysis 66 2.6 RELATED WORK 70 Boosting bandwidth –Dynamic Channel Sharing on 3D Stacked Memory 72 3.1 Background: Memory Operations 72 3.1.1. Memory Controller 72 3.1.2 DRAM column access sequence 73 3.2 Related Work 74 3.3. CHANNEL SHARING ENABLED MEMORY SYSTEM 76 3.3.1 Hardware Requirements 78 3.3.2 Operation Sequence 81 3.4 Analysis 87 3.4.1 Experiment Environment 87 3.4.2 Performance 88 3.4.3 Overhead 90 CONCLUSION 92 REFERENCES 94 국문초록 107Docto

SNU Open Repository and Archive

Aging-Aware Request Scheduling for Non-Volatile Main Memory

Author: Arjomand M.
Balaji A.
Balaji A.
Balaji A.
Bolchini C.
Bucek J.
Burr G. W.
Chandrasekar K.
Das A.
Das A.
Das A.
Das A.
Das A.
Das A.
Das A.
Das A.
Das A.
David H.
Gao R.
Hassan H.
Hisamoto D.
Huang L.
Jiang L.
K.
Khan S.
Kim J.
Kim J. S.
Kim J. S.
Kim J. S.
Kim Y.
Kim Y.
Kim Y.
Kim Y.
Kraak D.
Kültürsay E.
Lalam A.
Lee B.
Lee B.
Lee B.
Lee D.
Lee D.
Lu Y.
Mallik A.
Mandelman J. A.
Meza J.
Meza J.
Meza J.
Meza J.
Mutlu O.
Mutlu O.
Mutlu O.
Mutlu O.
Nesbit K. J.
Patel M.
Pelley S.
Poremba M.
Qureshi M. K.
Qureshi M. K.
Qureshi M. K.
Qureshi M. K.
Rixner S.
Sadasivam S. K.
Seshadri V.
Song S.
Song S.
Song S.
Song S.
Song S.
Song S.
Srinivasan J.
Subramanian L.
Subramanian L.
Titirsha T.
Titirsha T.
Usui H.
Wong H.-S. P.
Xia F.
Xiong F.
Yavits L.
Yilmaz C.
Yoon H.
Yoon H.
Zhang J.
Zhao J.
Zuravleff W. K.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 30/11/2020
Field of study

Modern computing systems are embracing non-volatile memory (NVM) to implement high-capacity and low-cost main memory. Elevated operating voltages of NVM accelerate the aging of CMOS transistors in the peripheral circuitry of each memory bank. Aggressive device scaling increases power density and temperature, which further accelerates aging, challenging the reliable operation of NVM-based main memory. We propose HEBE, an architectural technique to mitigate the circuit aging-related problems of NVM-based main memory. HEBE is built on three contributions. First, we propose a new analytical model that can dynamically track the aging in the peripheral circuitry of each memory bank based on the bank's utilization. Second, we develop an intelligent memory request scheduler that exploits this aging model at run time to de-stress the peripheral circuitry of a memory bank only when its aging exceeds a critical threshold. Third, we introduce an isolation transistor to decouple parts of a peripheral circuit operating at different voltages, allowing the decoupled logic blocks to undergo long-latency de-stress operations independently and off the critical path of memory read and write accesses, improving performance. We evaluate HEBE with workloads from the SPEC CPU2017 Benchmark suite. Our results show that HEBE significantly improves both performance and lifetime of NVM-based main memory.Comment: To appear in ASP-DAC 202

arXiv.org e-Print Archive

Crossref

Recommended from our members

A Statistical View of Architecture Design

Author: Deng Zhaoxia
Publication venue: eScholarship, University of California
Publication date: 01/01/2017
Field of study

Computer architectures are becoming more and more complicated to meet the continuouslyincreasing demand on performance, security and sustainability from applications. Many factorsexist in the design and engineering space of various components and policies in the architectures,and it is not intuitive how these factors interact with each other and how they make impactson the architecture behaviors. Seeking for the best architectures for specific applicationsand requirements automatically is even more challenging. Meanwhile, the architecture designneed to deal with more and more non-determinism from lower level technologies. Emergingtechnologies exhibit statistical properties inherently, such as the wearout phenomenon inNEMs, PCM, ReRAM, etc. Due to the manufacturing and processing variations, there alsoexists variability among different devices or within the same device (e.g. different cells onthe same memory chip). Hence, to better understand and control the architecture behaviors,we introduce the statistical perspective of architecture design: by specifying the architecturaldesign goals and the desired statistical properties, we guide the architecture design with thesestatistical properties and exploit a series of techniques to achieve these properties.In the first part of the thesis, we introduce Herniated Hash Tables. Our architectural designgoal is that the hash table implementation is highly scalable in both storage efficiency andperformance, while the desired statistical property is to achieve as good storage efficiencyand performance as with uniform distributions given non-uniform distributions across hashbuckets. Herniated Hash Tables exploit multi-level phase change memory (PCM) to in-placeexpand storage for each hash bucket to accommodate asymmetrically chained entries. Theorganization, coupled with an addressing and prefetching scheme, also improves performancesignificantly by creating more memory parallelism.In the second part of the thesis, we introduce Lemonade from Lemons, harnessing devicewearout to create limited-use security architectures. The architectural design goal is tocreate hardware security architectures that resist attacks by statistically enforcing an upperbound on hardware uses, and consequently attacks. The desired statistical property is that thesystem-level minimum and maximum uses can be guaranteed with high probabilities despite ofdevice-level variability. We introduce techniques for architecturally controlling these boundsand explore the cost in area, energy and latency of using these techniques to achieve systemlevelusage targets given device-level wearout distributions.In the third part of the thesis, we demonstrate Memory Cocktail Therapy: A General,Learning-Based Framework to Optimize Dynamic Tradeoffs in NVMs. Limited write enduranceand long latencies remain the primary challenges of building practical memory systems fromNVMs. Researchers have proposed a variety of architectural techniques to achieve differenttradeoffs between lifetime, performance and energy efficiency; however, no individual techniquecan satisfy requirements for all applications and different objectives. Our architecturaldesign goal is that NVM systems can achieve optimal tradeoffs for specific applications andobjectives, and the statistical goal is that the selected NVM configuration is nearly optimal.Memory Cocktail Therapy uses machine learning techniques to model the architecture behaviorsin terms of all the configurable parameters based on a small number of sample configurations.Then, it selects the optimal configuration according to user-defined objectives whichleads to the desired tradeoff between performance, lifetime and energy efficiency

eScholarship - University of California

Exploiting managed language semantics to optimize for hardware heterogeneity

Author: Akram Shoaib
Publication venue
Publication date: 01/01/2019
Field of study

Ghent University Academic Bibliography

Computational Sprinting: Exceeding Sustainable Power in Thermally Constrained Systems

Author: Raghavan Arun
Publication venue: ScholarlyCommons
Publication date: 01/01/2013
Field of study

Although process technology trends predict that transistor sizes will continue to shrink for a few more generations, voltage scaling has stalled and thus future chips are projected to be increasingly more power hungry than previous generations. Particularly in mobile devices which are severely cooling constrained, it is estimated that the peak operation of a future chip could generate heat ten times faster than than the device can sustainably vent. However, many mobile applications do not demand sustained performance; rather they comprise short bursts of computation in response to sporadic user activity. To improve responsiveness for such applications, this dissertation proposes computational sprinting, in which a system greatly exceeds sustainable power margins (by up to 10Ã?) to provide up to a few seconds of high-performance computation when a user interacts with the device. Computational sprinting exploits the material property of thermal capacitance to temporarily store the excess heat generated when sprinting. After sprinting, the chip returns to sustainable power levels and dissipates the stored heat when the system is idle. This dissertation: (i) broadly analyzes thermal, electrical, hardware, and software considerations to analyze the feasibility of engineering a system which can provide the responsiveness of a plat- form with 10Ã? higher sustainable power within today\u27s cooling constraints, (ii) leverages existing sources of thermal capacitance to demonstrate sprinting on a real system today, and (iii) identifies the energy-performance characteristics of sprinting operation to determine runtime sprint pacing policies

ScholarlyCommons@Penn

Design Guidelines for High-Performance SCM Hierarchies

Author: Bugnion Edouard
Daglis Alexandros
Falsafi Babak
Picorel Javier
Pnevmatikatos Dionisios
Sutherland Mark
Ustiugov Dmitrii
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 07/03/2019
Field of study

With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it will deliver the much-anticipated high density and access latencies within only a few factors of DRAM. Nevertheless, the latency-sensitive nature of memory-resident services makes seamless integration of SCM in servers questionable. In this paper, we ask the question of how best to introduce SCM for such servers to improve overall performance/cost over existing DRAM-only architectures. We first show that even with the most optimistic latency projections for SCM, the higher memory access latency results in prohibitive performance degradation. However, we find that deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the performance of an SCM-mostly memory system competitive. The high degree of spatial locality that memory-resident services exhibit not only simplifies the DRAM cache's design as page-based, but also enables the amortization of increased SCM access latencies and the mitigation of SCM's read/write latency disparity. We identify the set of memory hierarchy design parameters that plays a key role in the performance and cost of a memory system combining an SCM technology and a 3D stacked DRAM cache. We then introduce a methodology to drive provisioning for each of these design parameters under a target performance/cost goal. Finally, we use our methodology to derive concrete results for specific SCM technologies. With PCM as a case study, we show that a two bits/cell technology hits the performance/cost sweet spot, reducing the memory subsystem cost by 40% while keeping performance within 3% of the best performing DRAM-only system, whereas single-level and triple-level cell organizations are impractical for use as memory replacements.Comment: Published at MEMSYS'1

arXiv.org e-Print Archive

Crossref