6 research outputs found

    Adaptive Resource Management Techniques for High Performance Multi-Core Architectures

    Get PDF
    Reducing the average memory access time is crucial for improving the performance of applications executing on multi-core architectures. With workload consolidation this becomes increasingly challenging due to shared resource contention. Previous works has proposed techniques for partitioning of shared resources (e.g. cache and bandwidth) and prefetch throttling with the goal of mitigating contention and reducing or hiding average memory access time.Cache partitioning in multi-core architectures is challenging due to the need to determine cache allocations with low computational overhead and the need to place the partitions in a locality-aware manner. The requirement for low computational overhead is important in order to have the capability to scale to large core counts. Previous work within multi-resource management has proposed coordinately managing a subset of the techniques: cache partitioning, bandwidth partitioning and prefetch throttling. However, coordinated management of all three techniques opens up new possible trade-offs and interactions which can be leveraged to gain better performance. This thesis contributes with two different resource management techniques: One resource manger for scalable cache partitioning and a multi-resource management technique for coordinated management of cache partitioning, bandwidth partitioning and prefetching. The scalable resource management technique for cache partitioning uses a distributed and asynchronous cache partitioning algorithm that works together with a flexible NUCA enforcement mechanism in order to give locality-aware placement of data and support fine-grained partitions. The algorithm adapts quickly to application phase changes. The distributed nature of the algorithm together with the low computational complexity, enables the solution to be implemented in hardware and scale to large core counts. The multi-resource management technique for coordinated management of cache partitioning bandwidth partitioning and prefetching is designed using the results from our in-depth characterisation from the entire SPEC CPU2006 suite. The solution consists of three local resource management techniques that together with a coordination mechanism provides allocations which takes the inter-resource interactions and trade-offs into account.Our evaluation shows that the distributed cache partitioning solution performs within 1% from the best known centralized solution, which cannot scale to large core counts. The solution improves performance by 9% and 16%, on average, on a 16 and 64-core multi-core architecture, respectively, compared to a shared last-level cache. The multi-resource management technique gives a performance increase of 11%, on average, over state-of-the-art and improves performance by 50% compared to the baseline 16-core multi-core without cache partitioning, bandwidth partitioning and prefetch throttling

    Cooperative cache scrubbing

    Get PDF
    Managing the limited resources of power and memory bandwidth while improving performance on multicore hardware is challeng-ing. In particular, more cores demand more memory bandwidth, and multi-threaded applications increasingly stress memory sys-tems, leading to more energy consumption. However, we demon-strate that not all memory traffic is necessary. For modern Java pro-grams, 10 to 60 % of DRAM writes are useless, because the data on these lines are dead- the program is guaranteed to never read them again. Furthermore, reading memory only to immediately zero ini-tialize it wastes bandwidth. We propose a software/hardware coop-erative solution: the memory manager communicates dead and zero lines with cache scrubbing instructions. We show how scrubbing instructions satisfy MESI cache coherence protocol invariants and demonstrate them in a Java Virtual Machine and multicore simula-tor. Scrubbing reduces average DRAM traffic by 59%, total DRAM energy by 14%, and dynamic DRAM energy by 57 % on a range of configurations. Cooperative software/hardware cache scrubbing reduces memory bandwidth and improves energy efficiency, two critical problems in modern systems

    ๋ฉ”๋ชจ๋ฆฌ ๊ฐ€์ƒ ์ฑ„๋„์„ ํ†ตํ•œ ๋ผ์ŠคํŠธ ๋ ˆ๋ฒจ ์บ์‹œ ํŒŒํ‹ฐ์…”๋‹

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2023. 2. ๊น€์žฅ์šฐ.Ensuring fairness or providing isolation between multiple workloads with distinct characteristics that are collocated on a single, shared-memory system is a challenge. Recent multicore processors provide last-level cache (LLC) hardware partitioning to provide hardware support for isolation, with the cache partitioning often specified by the user. While more LLC capacity often results in higher performance, in this dissertation we identify that a workload allocated more LLC capacity result in worse performance on real-machine experiments, which we refer to as MiW (more is worse). Through various controlled experiments, we identify that another workload with less LLC capacity causes more frequent LLC misses. The workload stresses the main memory system shared by both workloads and degrades the performance of the former workload even if LLC partitioning is used (a balloon effect). To resolve this problem, we propose virtualizing the data path of main memory controllers and dedicating the memory virtual channels (mVCs) to each group of applications, grouped for LLC partitioning. mVC can further fine-tune the performance of groups by differentiating buffer sizes among mVCs. It can reduce the total system cost by executing latency-critical and throughput-oriented workloads together on shared machines, of which performance criteria can be achieved only on dedicated machines if mVCs are not supported. Experiments on a simulated chip multiprocessor show that our proposals effectively eliminate the MiW phenomenon, hence providing additional opportunities for workload consolidation in a datacenter. Our case study demonstrates potential savings of machine count by 21.8% with mVC, which would otherwise violate a service level objective (SLO).์ตœ๊ทผ ๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์€ ํ•™๊ณ„ ๋ฐ ์—…๊ณ„์˜ ์ฃผ๋ชฉ์„ ๋ฐ›๊ณ  ์žˆ์œผ๋ฉฐ, ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์€ ์„œ๋กœ ๋‹ค๋ฅธ ํŠน์„ฑ์„ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋“ค์ด ๋™์‹œ์— ์‹คํ–‰๋˜๋Š”๋ฐ, ์ด ๋•Œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋“ค์€ ์‹œ์Šคํ…œ์˜ ์—ฌ๋Ÿฌ ์ž์›๋“ค์„ ๊ณต์œ ํ•˜๊ฒŒ ๋œ๋‹ค. ๋Œ€ํ‘œ์ ์ธ ๊ณต์œ  ์ž์›์˜ ์˜ˆ๋กœ๋Š” ๋ผ์ŠคํŠธ ๋ ˆ๋ฒจ ์บ์‹œ(LLC) ๋ฐ ๋ฉ”์ธ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋“ค ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋‹จ์ผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ํŠน์„ฑ์„ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋“ค ๊ฐ„์— ๊ณต์œ  ์ž์›์˜ ๊ณต์ •์„ฑ์„ ๋ณด์žฅํ•˜๊ฑฐ๋‚˜ ํŠน์ • ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์ด ๋‹ค๋ฅธ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์œผ๋กœ๋ถ€ํ„ฐ ๊ฐ„์„ญ์„ ๋ฐ›์ง€ ์•Š๋„๋ก ๊ฒฉ๋ฆฌํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ค์šด ์ผ์ด๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ตœ๊ทผ ๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ๋Š” LLC ํŒŒํ‹ฐ์…”๋‹์„ ํ•˜๋“œ์›จ์–ด์ ์œผ๋กœ ์ œ๊ณตํ•˜๊ธฐ ์‹œ์ž‘ํ•˜์˜€๋‹ค. ์‚ฌ์šฉ์ž๋Š” ํ•˜๋“œ์›จ์–ด์ ์œผ๋กœ ์ œ๊ณต๋œ LLC ํŒŒํ‹ฐ์…”๋‹์„ ํ†ตํ•ด ํŠน์ • ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์— ์›ํ•˜๋Š” ์ˆ˜์ค€๋งŒํผ LLC๋ฅผ ํ• ๋‹นํ•˜์—ฌ ๋‹ค๋ฅธ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์œผ๋กœ๋ถ€ํ„ฐ ๊ฐ„์„ญ์„ ๋ฐ›์ง€ ์•Š๋„๋ก ๊ฒฉ๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค. ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ LLC ์šฉ๋Ÿ‰์„ ๋งŽ์ด ํ• ๋‹น ๋ฐ›์„์ˆ˜๋ก ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์ง€๋งŒ, ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๋” ๋งŽ์€ LLC ์šฉ๋Ÿ‰์„ ํ• ๋‹น ๋ฐ›์€ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์ด ์˜คํžˆ๋ ค ์„ฑ๋Šฅ ์ €ํ•˜๋œ๋‹ค๋Š” ์‚ฌ์‹ค(MiW, more is worse)์„ ํ•˜๋“œ์›จ์–ด์  ์‹คํ—˜์„ ํ†ตํ•ด ํ™•์ธํ•˜์˜€๋‹ค. ๋‹ค์–‘ํ•œ ํ†ต์ œ๋œ ์‹คํ—˜์„ ํ†ตํ•ด LLC ํŒŒํ‹ฐ์…”๋‹์„ ํ†ตํ•ด LLC ์šฉ๋Ÿ‰์„ ์ ๊ฒŒ ํ• ๋‹น ๋ฐ›์€ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์ด LLC ๋ฏธ์Šค๋ฅผ ๋” ์ž์ฃผ ๋ฐœ์ƒ์‹œํ‚จ๋‹ค๋Š” ์‚ฌ์‹ค์„ ํ™•์ผ ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. LLC ์šฉ๋Ÿ‰์„ ์ ๊ฒŒ ํ• ๋‹น ๋ฐ›์€ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์€ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋“ค์ด ๊ณต์œ ํ•˜๋Š” ๋ฉ”์ธ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์— ์ŠคํŠธ๋ ˆ์Šค๋ฅผ ๊ฐ€ํ•˜๊ณ , LLC ํŒŒํ‹ฐ์…”๋‹์„ ํ†ตํ•ด ์„œ๋กœ ๊ฒฉ๋ฆฌ๋ฅผ ํ•˜์˜€์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ์„ฑ๋Šฅ์„ ์ €ํ•˜์‹œ์ผฐ๋‹ค. MiW ํ˜„์ƒ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๋ฉ”์ธ ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ์˜ ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๋ฅผ ๊ฐ€์ƒํ™”ํ•˜๊ณ  LLC ํŒŒํ‹ฐ์…”๋‹์— ์˜ํ•ด ๊ทธ๋ฃนํ™”๋œ ๊ฐ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ ๊ทธ๋ฃน์— ์ „์šฉ์œผ๋กœ ํ• ๋‹น๋˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ๊ฐ€์ƒ ์ฑ„๋„(mVC)์„ ์ œ์•ˆํ•˜์˜€๋‹ค. mVC๋ฅผ ํ†ตํ•ด ๊ฐ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ ๊ทธ๋ฃน์€ ๋…๋ฆฝ์ ์ธ ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๋ฅผ ์†Œ์œ ํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ๊ฐ€์ƒํ™” ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ํŠน์ • ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ ๊ทธ๋ฃน์ด ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๋ฅผ ๋…์ ํ•˜๋”๋ผ๋„ ๋‹ค๋ฅธ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋“ค์€ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์œ ๋ฐœํ•  ์ˆ˜ ์—†๊ฒŒ ๋˜์–ด ์„œ๋กœ ๊ฒฉ๋ฆฌ๋œ ํ™˜๊ฒฝ์„ ์กฐ์„ฑํ•œ๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ mVC์˜ ๋ฒ„ํผ ํฌ๊ธฐ๋ฅผ ์กฐ์ •ํ•˜์—ฌ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ ๊ทธ๋ฃน์˜ ์„ฑ๋Šฅ ๋ฏธ์„ธ ์กฐ์ •์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•˜์˜€๋‹ค. mVC๋ฅผ ๋„์ž…ํ•จ์œผ๋กœ์จ ์ „์ฒด์ ์ธ ์‹œ์Šคํ…œ ๋น„์šฉ์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. ์ง€์—ฐ ์‹œ๊ฐ„์ด ์ค‘์š”ํ•œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๊ณผ ์ฒ˜๋ฆฌ๋Ÿ‰์ด ์ค‘์š”ํ•œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์„ ํ•จ๊ป˜ ์‹คํ–‰ํ•  ๋•Œ mVC๊ฐ€ ์—†์„ ๊ฒฝ์šฐ์—๋Š” ์ง€์—ฐ ์‹œ๊ฐ„์˜ ์„ฑ๋Šฅ ๊ธฐ์ค€์น˜๋ฅผ ๋งŒ์กฑํ•  ์ˆ˜ ์—†์—ˆ์ง€๋งŒ, mVC๋ฅผ ํ†ตํ•ด ์„ฑ๋Šฅ ๊ธฐ์ค€์น˜๋ฅผ ๋งŒ์กฑํ•˜๋ฉด์„œ ์‹œ์Šคํ…œ์˜ ์ด ๋น„์šฉ์„ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋ฉ€ํ‹ฐ ์นฉ ํ”„๋กœ์„ธ์„œ๋ฅผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” MiW ํ˜„์ƒ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ œ๊ฑฐํ•จ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๋˜ํ•œ, ๋ฐ์ดํ„ฐ ์„ผํ„ฐ์—์„œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋“ค์˜ ๋™์‹œ ์‹คํ–‰์„ ์œ„ํ•œ ์ถ”๊ฐ€์ ์ธ ๊ฐ€๋Šฅ์„ฑ์„ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ์‚ฌ๋ก€ ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด mVC๋ฅผ ๋„์ž…ํ•˜์—ฌ ์‹œ์Šคํ…œ ๋น„์šฉ์„ 21.8%๊นŒ์ง€ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€์œผ๋ฉฐ, mVC๋ฅผ ๋„์ž…ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ์—๋Š” ์„œ๋น„์Šค ๊ธฐ์ค€(SLO)์„ ๋งŒ์กฑํ•˜์ง€ ์•Š์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค.1. Introduction 1 1.1 Research Contributions 5 1.2 Outline 6 2. Background 7 2.1 Cache Hierarchy and Policies 7 2.2 Cache Partitioning 10 2.3 Benchmarks 15 2.3.1 Working Set Size 16 2.3.2 Top-down Analysis 17 2.3.3 Profiling Tools 19 3. More-is-Worse Phenonmenon 21 3.1 More LLC Leading to Performance Drop 21 3.2 Synthetic Workload Evaluation 27 3.3 Impact on Latency-critical Workloads 31 3.4 Workload Analysis 33 3.5 The Root Cause of the MiW Phenomenon 35 3.6 Limitations of Existing Solutions 41 3.6.1 Memory Bandwidth Throttling 41 3.6.2 Fairness-aware Memory Scheduling 44 4. Virtualizing Memory Channels 49 4.1 Memory Virtual Channel (mVC) 50 4.2 mVC Buffer Allocation Strategies 52 4.3 Evaluation 57 4.3.1 Experimental Setup 57 4.3.2 Reproducing Hardware Results 59 4.3.3 Mitigating MiW through mVC 60 4.3.4 Evaluation on Four Groups 64 4.3.5 Potentials for Operating Cost Savings with mVC 66 5. Related Work 71 5.1 Component-wise QoS/Fairness for Shared Resources 71 5.2 Holistic Approaches to QoS/Fairness 73 5.3 MiW on Recent Architectures 74 6. Conclusion 76 6.1 Discussion 78 6.2 Future Work 79 Bibliography 81 ๊ตญ๋ฌธ์ดˆ๋ก 89๋ฐ•

    ANALYTICAL MODEL FOR CHIP MULTIPROCESSOR MEMORY HIERARCHY DESIGN AND MAMAGEMENT

    Get PDF
    Continued advances in circuit integration technology has ushered in the era of chip multiprocessor (CMP) architectures as further scaling of the performance of conventional wide-issue superscalar processor architectures remains hard and costly. CMP architectures take advantageof Mooreยกยฏs Law by integrating more cores in a given chip area rather than a single fastyet larger core. They achieve higher performance with multithreaded workloads. However,CMP architectures pose many new memory hierarchy design and management problems thatmust be addressed. For example, how many cores and how much cache capacity must weintegrate in a single chip to obtain the best throughput possible? Which is more effective,allocating more cache capacity or memory bandwidth to a program?This thesis research develops simple yet powerful analytical models to study two newmemory hierarchy design and resource management problems for CMPs. First, we considerthe chip area allocation problem to maximize the chip throughput. Our model focuses onthe trade-off between the number of cores, cache capacity, and cache management strategies.We find that different cache management schemes demand different area allocation to coresand cache to achieve their maximum performance. Second, we analyze the effect of cachecapacity partitioning on the bandwidth requirement of a given program. Furthermore, ourmodel considers how bandwidth allocation to different co-scheduled programs will affect theindividual programsยกยฏ performance. Since the CMP design space is large and simulating only one design point of the designspace under various workloads would be extremely time-consuming, the conventionalsimulation-based research approach quickly becomes ineffective. We anticipate that ouranalytical models will provide practical tools to CMP designers and correctly guide theirdesign efforts at an early design stage. Furthermore, our models will allow them to betterunderstand potentially complex interactions among key design parameters

    Shared Resource Management for Non-Volatile Asymmetric Memory

    Get PDF
    Non-volatile memory (NVM), such as Phase-Change Memory (PCM), is a promising energy-efficient candidate to replace DRAM. It is desirable because of its non-volatility, good scalability and low idle power. NVM, nevertheless, faces important challenges. The main problems are: writes are much slower and more power hungry than reads and write bandwidth is much lower than read bandwidth. Hybrid main memory architecture, which consists of a large NVM and a small DRAM, may become a solution for architecting NVM as main memory. Adding an extra layer of cache mitigates the drawbacks of NVM writes. However, writebacks from the last-level cache (LLC) might still (a) overwhelm the limited NVM write bandwidth and stall the application, (b) shorten lifetime and (c) increase energy consumption. Effectively utilizing shared resources, such as the last-level cache and the memory bandwidth, is crucial to achieving high performance for multi-core systems. No existing cache and bandwidth allocation scheme exploits the read/write asymmetry property, which is fundamental in NVM. This thesis tries to consider the asymmetry property in partitioning the cache and memory bandwidth for NVM systems. The thesis proposes three writeback-aware schemes to manage the resources in NVM systems. First, a runtime mechanism, Writeback-aware Cache Partitioning (WCP), is proposed to partition the shared LLC among multiple applications. Unlike past partitioning schemes, WCP considers the reduction in cache misses as well as writebacks. Second, a new runtime mechanism, Writeback-aware Bandwidth Partitioning (WBP), partitions NVM service cycles among applications. WBP uses a bandwidth partitioning weight to reflect the importance of writebacks (in addition to LLC misses) to bandwidth allocation. A companion Dynamic Weight Adjustment scheme dynamically selects the cache partitioning weight to maximize system performance. Third, Unified Writeback-aware Partitioning (UWP) partitions the last-level cache and the memory bandwidth cooperatively. UWP can further improve the system performance by considering the interaction of cache partitioning and bandwidth partitioning. The three proposed schemes improve system performance by considering the unique read/write asymmetry property of NVM
    corecore