Search CORE

6 research outputs found

Adaptive Resource Management Techniques for High Performance Multi-Core Architectures

Author: Holtryd Nadja
Publication venue
Publication date: 01/01/2021
Field of study

Reducing the average memory access time is crucial for improving the performance of applications executing on multi-core architectures. With workload consolidation this becomes increasingly challenging due to shared resource contention. Previous works has proposed techniques for partitioning of shared resources (e.g. cache and bandwidth) and prefetch throttling with the goal of mitigating contention and reducing or hiding average memory access time.Cache partitioning in multi-core architectures is challenging due to the need to determine cache allocations with low computational overhead and the need to place the partitions in a locality-aware manner. The requirement for low computational overhead is important in order to have the capability to scale to large core counts. Previous work within multi-resource management has proposed coordinately managing a subset of the techniques: cache partitioning, bandwidth partitioning and prefetch throttling. However, coordinated management of all three techniques opens up new possible trade-offs and interactions which can be leveraged to gain better performance. This thesis contributes with two different resource management techniques: One resource manger for scalable cache partitioning and a multi-resource management technique for coordinated management of cache partitioning, bandwidth partitioning and prefetching. The scalable resource management technique for cache partitioning uses a distributed and asynchronous cache partitioning algorithm that works together with a flexible NUCA enforcement mechanism in order to give locality-aware placement of data and support fine-grained partitions. The algorithm adapts quickly to application phase changes. The distributed nature of the algorithm together with the low computational complexity, enables the solution to be implemented in hardware and scale to large core counts. The multi-resource management technique for coordinated management of cache partitioning bandwidth partitioning and prefetching is designed using the results from our in-depth characterisation from the entire SPEC CPU2006 suite. The solution consists of three local resource management techniques that together with a coordination mechanism provides allocations which takes the inter-resource interactions and trade-offs into account.Our evaluation shows that the distributed cache partitioning solution performs within 1% from the best known centralized solution, which cannot scale to large core counts. The solution improves performance by 9% and 16%, on average, on a 16 and 64-core multi-core architecture, respectively, compared to a shared last-level cache. The multi-resource management technique gives a performance increase of 11%, on average, over state-of-the-art and improves performance by 50% compared to the baseline 16-core multi-core without cache partitioning, bandwidth partitioning and prefetch throttling

Chalmers Research

CMP off-chip bandwidth scheduling guided by instruction criticality

Author: Gregorio J.A.
Prieto P.
Puente V.
Publication venue
Publication date: 01/01/2013
Field of study

Crossref

Open Access Repository

Cooperative cache scrubbing

Author: Blackburn Stephen M.
Eeckhout Lieven
Heirman Wim
McKinley Kathryn S
Sartor Jennifer
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 01/01/2014
Field of study

Managing the limited resources of power and memory bandwidth while improving performance on multicore hardware is challeng-ing. In particular, more cores demand more memory bandwidth, and multi-threaded applications increasingly stress memory sys-tems, leading to more energy consumption. However, we demon-strate that not all memory traffic is necessary. For modern Java pro-grams, 10 to 60 % of DRAM writes are useless, because the data on these lines are dead- the program is guaranteed to never read them again. Furthermore, reading memory only to immediately zero ini-tialize it wastes bandwidth. We propose a software/hardware coop-erative solution: the memory manager communicates dead and zero lines with cache scrubbing instructions. We show how scrubbing instructions satisfy MESI cache coherence protocol invariants and demonstrate them in a Java Virtual Machine and multicore simula-tor. Scrubbing reduces average DRAM traffic by 59%, total DRAM energy by 14%, and dynamic DRAM energy by 57 % on a range of configurations. Cooperative software/hardware cache scrubbing reduces memory bandwidth and improves energy efficiency, two critical problems in modern systems

CiteSeerX

Crossref

Ghent University Academic Bibliography

The Australian National University

메모리 가상 채널을 통한 라스트 레벨 캐시 파티셔닝

Author: 정종욱
Publication venue: 서울대학교 대학원
Publication date: 01/02/2023
Field of study

학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2023. 2. 김장우.Ensuring fairness or providing isolation between multiple workloads with distinct characteristics that are collocated on a single, shared-memory system is a challenge. Recent multicore processors provide last-level cache (LLC) hardware partitioning to provide hardware support for isolation, with the cache partitioning often specified by the user. While more LLC capacity often results in higher performance, in this dissertation we identify that a workload allocated more LLC capacity result in worse performance on real-machine experiments, which we refer to as MiW (more is worse). Through various controlled experiments, we identify that another workload with less LLC capacity causes more frequent LLC misses. The workload stresses the main memory system shared by both workloads and degrades the performance of the former workload even if LLC partitioning is used (a balloon effect). To resolve this problem, we propose virtualizing the data path of main memory controllers and dedicating the memory virtual channels (mVCs) to each group of applications, grouped for LLC partitioning. mVC can further fine-tune the performance of groups by differentiating buffer sizes among mVCs. It can reduce the total system cost by executing latency-critical and throughput-oriented workloads together on shared machines, of which performance criteria can be achieved only on dedicated machines if mVCs are not supported. Experiments on a simulated chip multiprocessor show that our proposals effectively eliminate the MiW phenomenon, hence providing additional opportunities for workload consolidation in a datacenter. Our case study demonstrates potential savings of machine count by 21.8% with mVC, which would otherwise violate a service level objective (SLO).최근 멀티코어 프로세서 기반 시스템은 학계 및 업계의 주목을 받고 있으며, 널리 사용되고 있다. 멀티코어 프로세서 기반 시스템은 서로 다른 특성을 가진 여러 응용 프로그램들이 동시에 실행되는데, 이 때 응용 프로그램들은 시스템의 여러 자원들을 공유하게 된다. 대표적인 공유 자원의 예로는 라스트 레벨 캐시(LLC) 및 메인 메모리를 들 수 있다. 이러한 단일 공유 메모리 시스템에서 서로 다른 특성을 가진 여러 응용 프로그램들 간에 공유 자원의 공정성을 보장하거나 특정 응용 프로그램이 다른 응용 프로그램으로부터 간섭을 받지 않도록 격리하는 것은 어려운 일이다. 이를 해결하기 위하여 최근 멀티코어 프로세서는 LLC 파티셔닝을 하드웨어적으로 제공하기 시작하였다. 사용자는 하드웨어적으로 제공된 LLC 파티셔닝을 통해 특정 응용 프로그램에 원하는 수준만큼 LLC를 할당하여 다른 응용 프로그램으로부터 간섭을 받지 않도록 격리할 수 있게 되었다. 일반적인 경우 LLC 용량을 많이 할당 받을수록 성능이 향상되는 경우가 많지만, 본 연구에서는 더 많은 LLC 용량을 할당 받은 응용 프로그램이 오히려 성능 저하된다는 사실(MiW, more is worse)을 하드웨어적 실험을 통해 확인하였다. 다양한 통제된 실험을 통해 LLC 파티셔닝을 통해 LLC 용량을 적게 할당 받은 응용 프로그램이 LLC 미스를 더 자주 발생시킨다는 사실을 확일 할 수 있었다. LLC 용량을 적게 할당 받은 응용 프로그램은 응용 프로그램들이 공유하는 메인 메모리 시스템에 스트레스를 가하고, LLC 파티셔닝을 통해 서로 격리를 하였음에도 불구하고 응용 프로그램의 성능을 저하시켰다. MiW 현상을 해결하기 위해 본 연구에서는 메인 메모리 컨트롤러의 데이터 경로를 가상화하고 LLC 파티셔닝에 의해 그룹화된 각 응용 프로그램 그룹에 전용으로 할당되는 메모리 가상 채널(mVC)을 제안하였다. mVC를 통해 각 응용 프로그램 그룹은 독립적인 데이터 경로를 소유한 것처럼 가상화 된다. 따라서 특정 응용 프로그램 그룹이 데이터 경로를 독점하더라도 다른 응용 프로그램들은 성능 저하를 유발할 수 없게 되어 서로 격리된 환경을 조성한다. 추가적으로 mVC의 버퍼 크기를 조정하여 응용 프로그램 그룹의 성능 미세 조정이 가능하도록 하였다. mVC를 도입함으로써 전체적인 시스템 비용을 줄일 수 있다. 지연 시간이 중요한 응용 프로그램과 처리량이 중요한 응용 프로그램을 함께 실행할 때 mVC가 없을 경우에는 지연 시간의 성능 기준치를 만족할 수 없었지만, mVC를 통해 성능 기준치를 만족하면서 시스템의 총 비용을 감소시킬 수 있었다. 멀티 칩 프로세서를 시뮬레이션한 실험 결과는 MiW 현상을 효과적으로 제거함을 보여주었다. 또한, 데이터 센터에서 응용 프로그램들의 동시 실행을 위한 추가적인 가능성을 제공하는 것을 보여주었다. 사례 연구를 통해 mVC를 도입하여 시스템 비용을 21.8%까지 절약할 수 있음을 보였으며, mVC를 도입하지 않은 경우에는 서비스 기준(SLO)을 만족하지 않음을 확인하였다.1. Introduction 1 1.1 Research Contributions 5 1.2 Outline 6 2. Background 7 2.1 Cache Hierarchy and Policies 7 2.2 Cache Partitioning 10 2.3 Benchmarks 15 2.3.1 Working Set Size 16 2.3.2 Top-down Analysis 17 2.3.3 Profiling Tools 19 3. More-is-Worse Phenonmenon 21 3.1 More LLC Leading to Performance Drop 21 3.2 Synthetic Workload Evaluation 27 3.3 Impact on Latency-critical Workloads 31 3.4 Workload Analysis 33 3.5 The Root Cause of the MiW Phenomenon 35 3.6 Limitations of Existing Solutions 41 3.6.1 Memory Bandwidth Throttling 41 3.6.2 Fairness-aware Memory Scheduling 44 4. Virtualizing Memory Channels 49 4.1 Memory Virtual Channel (mVC) 50 4.2 mVC Buffer Allocation Strategies 52 4.3 Evaluation 57 4.3.1 Experimental Setup 57 4.3.2 Reproducing Hardware Results 59 4.3.3 Mitigating MiW through mVC 60 4.3.4 Evaluation on Four Groups 64 4.3.5 Potentials for Operating Cost Savings with mVC 66 5. Related Work 71 5.1 Component-wise QoS/Fairness for Shared Resources 71 5.2 Holistic Approaches to QoS/Fairness 73 5.3 MiW on Recent Architectures 74 6. Conclusion 76 6.1 Discussion 78 6.2 Future Work 79 Bibliography 81 국문초록 89박

SNU Open Repository and Archive

ANALYTICAL MODEL FOR CHIP MULTIPROCESSOR MEMORY HIERARCHY DESIGN AND MAMAGEMENT

Author: Oh Tae Cheol
Publication venue
Publication date: 30/01/2011
Field of study

Continued advances in circuit integration technology has ushered in the era of chip multiprocessor (CMP) architectures as further scaling of the performance of conventional wide-issue superscalar processor architectures remains hard and costly. CMP architectures take advantageof Moore¡¯s Law by integrating more cores in a given chip area rather than a single fastyet larger core. They achieve higher performance with multithreaded workloads. However,CMP architectures pose many new memory hierarchy design and management problems thatmust be addressed. For example, how many cores and how much cache capacity must weintegrate in a single chip to obtain the best throughput possible? Which is more effective,allocating more cache capacity or memory bandwidth to a program?This thesis research develops simple yet powerful analytical models to study two newmemory hierarchy design and resource management problems for CMPs. First, we considerthe chip area allocation problem to maximize the chip throughput. Our model focuses onthe trade-off between the number of cores, cache capacity, and cache management strategies.We find that different cache management schemes demand different area allocation to coresand cache to achieve their maximum performance. Second, we analyze the effect of cachecapacity partitioning on the bandwidth requirement of a given program. Furthermore, ourmodel considers how bandwidth allocation to different co-scheduled programs will affect theindividual programs¡¯ performance. Since the CMP design space is large and simulating only one design point of the designspace under various workloads would be extremely time-consuming, the conventionalsimulation-based research approach quickly becomes ineffective. We anticipate that ouranalytical models will provide practical tools to CMP designers and correctly guide theirdesign efforts at an early design stage. Furthermore, our models will allow them to betterunderstand potentially complex interactions among key design parameters

D-Scholarship@Pitt

Shared Resource Management for Non-Volatile Asymmetric Memory

Author: Zhou Miao
Publication venue
Publication date: 01/10/2015
Field of study

Non-volatile memory (NVM), such as Phase-Change Memory (PCM), is a promising energy-efficient candidate to replace DRAM. It is desirable because of its non-volatility, good scalability and low idle power. NVM, nevertheless, faces important challenges. The main problems are: writes are much slower and more power hungry than reads and write bandwidth is much lower than read bandwidth. Hybrid main memory architecture, which consists of a large NVM and a small DRAM, may become a solution for architecting NVM as main memory. Adding an extra layer of cache mitigates the drawbacks of NVM writes. However, writebacks from the last-level cache (LLC) might still (a) overwhelm the limited NVM write bandwidth and stall the application, (b) shorten lifetime and (c) increase energy consumption. Effectively utilizing shared resources, such as the last-level cache and the memory bandwidth, is crucial to achieving high performance for multi-core systems. No existing cache and bandwidth allocation scheme exploits the read/write asymmetry property, which is fundamental in NVM. This thesis tries to consider the asymmetry property in partitioning the cache and memory bandwidth for NVM systems. The thesis proposes three writeback-aware schemes to manage the resources in NVM systems. First, a runtime mechanism, Writeback-aware Cache Partitioning (WCP), is proposed to partition the shared LLC among multiple applications. Unlike past partitioning schemes, WCP considers the reduction in cache misses as well as writebacks. Second, a new runtime mechanism, Writeback-aware Bandwidth Partitioning (WBP), partitions NVM service cycles among applications. WBP uses a bandwidth partitioning weight to reflect the importance of writebacks (in addition to LLC misses) to bandwidth allocation. A companion Dynamic Weight Adjustment scheme dynamically selects the cache partitioning weight to maximize system performance. Third, Unified Writeback-aware Partitioning (UWP) partitions the last-level cache and the memory bandwidth cooperatively. UWP can further improve the system performance by considering the interaction of cache partitioning and bandwidth partitioning. The three proposed schemes improve system performance by considering the unique read/write asymmetry property of NVM

D-Scholarship@Pitt