453 research outputs found

    Fault- and Yield-Aware On-Chip Memory Design and Management

    Get PDF
    Ever decreasing device size causes more frequent hard faults, which becomes a serious burden to processor design and yield management. This problem is particularly pronounced in the on-chip memory which consumes up to 70% of a processor' s total chip area. Traditional circuit-level techniques, such as redundancy and error correction code, become less effective in error-prevalent environments because of their large area overhead. In this work, we suggest an architectural solution to building reliable on-chip memory in the future processor environment. Our approaches have two parts, a design framework and architectural techniques for on-chip memory structures. Our design framework provides important architectural evaluation metrics such as yield, area, and performance based on low level defects and process variations parameters. Processor architects can quickly evaluate their designs' characteristics in terms of yield, area, and performance. With the framework, we develop architectural yield enhancement solutions for on-chip memory structures including L1 cache, L2 cache and directory memory. Our proposed solutions greatly improve yield with negligible area and performance overhead. Furthermore, we develop a decoupled yield model of compute cores and L2 caches in CMPs, which show that there will be many more L2 caches than compute cores in a chip. We propose efficient utilization techniques for excess caches. Evaluation results show that excess caches significantly improve overall performance of CMPs

    Resource-aware scheduling for 2D/3D multi-/many-core processor-memory systems

    Get PDF
    This dissertation addresses the complexities of 2D/3D multi-/many-core processor-memory systems, focusing on two key areas: enhancing timing predictability in real-time multi-core processors and optimizing performance within thermal constraints. The integration of an increasing number of transistors into compact chip designs, while boosting computational capacity, presents challenges in resource contention and thermal management. The first part of the thesis improves timing predictability. We enhance shared cache interference analysis for set-associative caches, advancing the calculation of Worst-Case Execution Time (WCET). This development enables accurate assessment of cache interference and the effectiveness of partitioned schedulers in real-world scenarios. We introduce TCPS, a novel task and cache-aware partitioned scheduler that optimizes cache partitioning based on task-specific WCET sensitivity, leading to improved schedulability and predictability. Our research explores various cache and scheduling configurations, providing insights into their performance trade-offs. The second part focuses on thermal management in 2D/3D many-core systems. Recognizing the limitations of Dynamic Voltage and Frequency Scaling (DVFS) in S-NUCA many-core processors, we propose synchronous thread migrations as a thermal management strategy. This approach culminates in the HotPotato scheduler, which balances performance and thermal safety. We also introduce 3D-TTP, a transient temperature-aware power budgeting strategy for 3D-stacked systems, reducing the need for Dynamic Thermal Management (DTM) activation. Finally, we present 3QUTM, a novel method for 3D-stacked systems that combines core DVFS and memory bank Low Power Modes with a learning algorithm, optimizing response times within thermal limits. This research contributes significantly to enhancing performance and thermal management in advanced processor-memory systems

    Efficient concurrent data structure access parallelism techniques for increasing scalability

    Get PDF
    Multi-core processors have revolutionised the way data structures are designed by bringing parallelism to mainstream computing. Key to exploiting hardware parallelism available in multi-core processors are concurrent data structures. However, some concurrent data structure abstractions are inherently sequential and incapable of harnessing the parallelism performance of multi-core processors. Designing and implementing concurrent data structures to harness hardware parallelism is challenging due to the requirement of correctness, efficiency and practicability under various application constraints. In this thesis, our research contribution is towards improving concurrent data structure access parallelism to increase data structure performance. We propose new design frameworks that improve access parallelism of already existing concurrent data structure designs. Also, we propose new concurrent data structure designs with significant performance improvements. To give an insight into the interplay between hardware and concurrent data structure access parallelism, we give a detailed analysis and model the performance scalability with varying parallelism.In the first part of the thesis, we focus on data structure semantic relaxation. By relaxing the semantics of a data structure, a bigger design space, that allows weaker synchronization and more useful parallelism, is unveiled. Investigating new data structure designs, capable of trading semantics for achieving better performance in a monotonic way, is a major challenge in the area. We algorithmically address this challenge in this part of the thesis. We present an efficient, lock-free, concurrent data structure design framework for out-of-order semantic relaxation. We introduce a new two-dimensional algorithmic design, that uses multiple instances of a given data structure to improve access parallelism. In the second part of the thesis, we propose an efficient priority queue that improves access parallelism by reducing the number of synchronization points for each operation. Priority queues are fundamental abstract data types, often used to manage limited resources in parallel systems. Typical proposed parallel priority queue implementations are based on heaps or skip lists. In recent literature, skip lists have been shown to be the most efficient design choice for implementing priority queues. Though numerous intricate implementations of skip list based queues have been proposed in the literature, their performance is constrained by the high number of global atomic updates per operation and the high memory consumption, which are proportional to the number of sub-lists in the queue. In this part of the thesis, we propose an alternative approach for designing lock-free linearizable priority queues, that significantly improve memory efficiency and throughput performance, by reducing the number of global atomic updates and memory consumption as compared to skip-list based queues. To achieve this, our new design combines two structures; a search tree and a linked list, forming what we call a Tree Search List Queue (TSLQueue). Subsequently, we analyse and introduce a model for lock-free concurrent data structure access parallelism. The major impediment to scaling concurrent data structures is memory contention when accessing shared data structure access points, leading to thread serialisation, and hindering parallelism. Aiming to address this challenge, a significant amount of work in the literature has proposed multi-access techniques that improve concurrent data structure parallelism. However, there is little work on analysing and modelling the execution behaviour of concurrent multi-access data structures especially in a shared memory setting. In this part of the thesis, we analyse and model the general execution behaviour of concurrent multi-access data structures in the shared memory setting. We study and analyse the behaviour of the two popular random access patterns: shared (Remote) and exclusive (Local) access, and the behaviour of the two most commonly used atomic primitives for designing lock-free data structures: Compare and Swap, and, Fetch and Add

    Staged reads: mitigating the impact of DRAM writes on DRAM reads

    Get PDF
    Journal ArticleMain memory latencies have always been a concern for system performance. Given that reads are on the criti- cal path for CPU progress, reads must be prioritized over writes. However, writes must be eventually processed and they often delay pending reads. In fact, a single channel in the main memory system offers almost no parallelism between reads and writes. This is because a single off-chip memory bus is shared by reads and writes and the direction of the bus has to be explicitly turned around when switching from writes to reads. This is an expensive operation and its cost is amortized by carrying out a burst of writes or reads every time the bus direction is switched. As a result, no reads can be processed while a memory channel is busy servicing writes. This paper proposes a novel mechanism to boost read-write parallelism and perform useful components of read operations even when the memory system is busy performing writes. If some of the banks are busy servicing writes, we start issuing reads to the other idle banks. The results of these reads are stored in a few registers near the memory chip's I/O pads. These results are quickly returned immediately following the bus turnaround. The process is referred to as a Staged Read because it decouples a single read operation into two stages, with the first step being performed in parallel with writes. This innovation can also be viewed as a form of prefetch that is internal to a memory chip. The proposed tech- nique works best when there is bank imbalance in the write stream. We also introduce a write scheduling algorithm that artificially creates bank imbalance and allows useful read operations to be performed during the write drain. Across a suite of memory-intensive workloads, we show that Staged Reads can boost throughput by up to 33% (average 7%) with an average DRAM access latency improvement of 17%, while incurring a very small cost (0.25%) in terms of memory chip area. The throughput improvements are even greater when considering write-intensive work-loads (average 11%) or future systems (average 12%)

    Auto-tuning similarity search algorithms on multi-core architectures

    Get PDF
    Cataloged from PDF version of article.In recent times, large high-dimensional datasets have become ubiquitous. Video and image repositories, financial, and sensor data are just a few examples of such datasets in practice. Many applications that use such datasets require the retrieval of data items similar to a given query item, or the nearest neighbors (NN or k-NN) of a given item. Another common query is the retrieval of multiple sets of nearest neighbors, i.e., multi k-NN, for different query items on the same data. With commodity multi-core CPUs becoming more and more widespread at lower costs, developing parallel algorithms for these search problems has become increasingly important. While the core nearest neighbor search problem is relatively easy to parallelize, it is challenging to tune it for optimality. This is due to the fact that the various performance-specific algorithmic parameters, or โ€œtuning knobsโ€, are inter-related and also depend on the data and query workloads. In this paper, we present (1) a detailed study of the various tuning knobs and their contributions on increasing the query throughput for parallelized versions of the two most common classes of high-dimensional multi-NN search algorithms: linear scan and tree traversal, and (2) an offline auto-tuner for setting these knobs by iteratively measuring actual query execution times for a given workload and dataset. We show experimentally that our auto-tuner reaches near-optimal performance and significantly outperforms un-tuned versions of parallel multi-NN algorithms for real video repository data on a variety of multi-core platforms. ยฉ Springer Science+Business Media New York 201

    A survey of cross-layer power-reliability tradeoffs in multi and many core systems-on-chip

    Full text link
    As systems-on-chip increase in complexity, the underlying technology presents us with significant challenges due to increased power consumption as well as decreased reliability. Today, designers must consider building systems that achieve the requisite functionality and performance using components that may be unreliable. In order to do so, it is crucial to understand the close interplay between the different layers of a system: technology, platform, and application. This will enable the most general tradeoff exploration, reaping the most benefits in power, performance and reliability. This paper surveys various cross layer techniques and approaches for power, performance, and reliability tradeoffs are technology, circuit, architecture and application layers. ยฉ 2013 Elsevier B.V. All rights reserved

    Exploiting Adaptive Techniques to Improve Processor Energy Efficiency

    Get PDF
    Rapid device-miniaturization keeps on inducing challenges in building energy efficient microprocessors. As the size of the transistors continuously decreasing, more uncertainties emerge in their operations. On the other hand, integrating more and more transistors on a single chip accentuates the need to lower its supply-voltage. This dissertation investigates one of the primary device uncertainties - timing error, in microprocessor performance bottleneck in NTC era. Then it proposes various innovative techniques to exploit these opportunities to maintain processor energy efficiency, in the context of emerging challenges. Evaluated with the cross-layer methodology, the proposed approaches achieve substantial improvements in processor energy efficiency, compared to other start-of-art techniques

    Doctor of Philosophy

    Get PDF
    dissertationThe internet-based information infrastructure that has powered the growth of modern personal/mobile computing is composed of powerful, warehouse-scale computers or datacenters. These heavily subscribed datacenters perform data-processing jobs under intense quality of service guarantees. Further, high-performance compute platforms are being used to model and analyze increasingly complex scientific problems and natural phenomena. To ensure that the high-performance needs of these machines are met, it is necessary to increase the efficiency of the memory system that supplies data to the processing cores. Many of the microarchitectural innovations that were designed to scale the memory wall (e.g., out-of-order instruction execution, on-chip caches) are being rendered less effective due to several emerging trends (e.g., increased emphasis on energy consumption, limited access locality). This motivates the optimization of the main memory system itself. The key to an efficient main memory system is the memory controller. In particular, the scheduling algorithm in the memory controller greatly influences its performance. This dissertation explores this hypothesis in several contexts. It develops tools to better understand memory scheduling and develops scheduling innovations for CPUs and GPUs. We propose novel memory scheduling techniques that are strongly aware of the access patterns of the clients as well as the microarchitecture of the memory device. Based on these, we present (i) a Dynamic Random Access Memory (DRAM) chip microarchitecture optimized for reducing write-induced slowdown, (ii) a memory scheduling algorithm that exploits these features, (iii) several memory scheduling algorithms to reduce the memory-related stall experienced by irregular General Purpose Graphics Processing Unit (GPGPU) applications, and (iv) the Utah Simulated Memory Module (USIMM), a detailed, validated simulator for DRAM main memory that we use for analyzing and proposing scheduler algorithms

    High-performance and hardware-aware computing: proceedings of the second International Workshop on New Frontiers in High-performance and Hardware-aware Computing (HipHaC\u2711), San Antonio, Texas, USA, February 2011 ; (in conjunction with HPCA-17)

    Get PDF
    High-performance system architectures are increasingly exploiting heterogeneity. The HipHaC workshop aims at combining new aspects of parallel, heterogeneous, and reconfigurable microprocessor technologies with concepts of high-performance computing and, particularly, numerical solution methods. Compute- and memory-intensive applications can only benefit from the full hardware potential if all features on all levels are taken into account in a holistic approach

    ๋ฉ”๋ชจ๋ฆฌ ๊ฐ€์ƒ ์ฑ„๋„์„ ํ†ตํ•œ ๋ผ์ŠคํŠธ ๋ ˆ๋ฒจ ์บ์‹œ ํŒŒํ‹ฐ์…”๋‹

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2023. 2. ๊น€์žฅ์šฐ.Ensuring fairness or providing isolation between multiple workloads with distinct characteristics that are collocated on a single, shared-memory system is a challenge. Recent multicore processors provide last-level cache (LLC) hardware partitioning to provide hardware support for isolation, with the cache partitioning often specified by the user. While more LLC capacity often results in higher performance, in this dissertation we identify that a workload allocated more LLC capacity result in worse performance on real-machine experiments, which we refer to as MiW (more is worse). Through various controlled experiments, we identify that another workload with less LLC capacity causes more frequent LLC misses. The workload stresses the main memory system shared by both workloads and degrades the performance of the former workload even if LLC partitioning is used (a balloon effect). To resolve this problem, we propose virtualizing the data path of main memory controllers and dedicating the memory virtual channels (mVCs) to each group of applications, grouped for LLC partitioning. mVC can further fine-tune the performance of groups by differentiating buffer sizes among mVCs. It can reduce the total system cost by executing latency-critical and throughput-oriented workloads together on shared machines, of which performance criteria can be achieved only on dedicated machines if mVCs are not supported. Experiments on a simulated chip multiprocessor show that our proposals effectively eliminate the MiW phenomenon, hence providing additional opportunities for workload consolidation in a datacenter. Our case study demonstrates potential savings of machine count by 21.8% with mVC, which would otherwise violate a service level objective (SLO).์ตœ๊ทผ ๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์€ ํ•™๊ณ„ ๋ฐ ์—…๊ณ„์˜ ์ฃผ๋ชฉ์„ ๋ฐ›๊ณ  ์žˆ์œผ๋ฉฐ, ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์€ ์„œ๋กœ ๋‹ค๋ฅธ ํŠน์„ฑ์„ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋“ค์ด ๋™์‹œ์— ์‹คํ–‰๋˜๋Š”๋ฐ, ์ด ๋•Œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋“ค์€ ์‹œ์Šคํ…œ์˜ ์—ฌ๋Ÿฌ ์ž์›๋“ค์„ ๊ณต์œ ํ•˜๊ฒŒ ๋œ๋‹ค. ๋Œ€ํ‘œ์ ์ธ ๊ณต์œ  ์ž์›์˜ ์˜ˆ๋กœ๋Š” ๋ผ์ŠคํŠธ ๋ ˆ๋ฒจ ์บ์‹œ(LLC) ๋ฐ ๋ฉ”์ธ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋“ค ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋‹จ์ผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ํŠน์„ฑ์„ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋“ค ๊ฐ„์— ๊ณต์œ  ์ž์›์˜ ๊ณต์ •์„ฑ์„ ๋ณด์žฅํ•˜๊ฑฐ๋‚˜ ํŠน์ • ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์ด ๋‹ค๋ฅธ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์œผ๋กœ๋ถ€ํ„ฐ ๊ฐ„์„ญ์„ ๋ฐ›์ง€ ์•Š๋„๋ก ๊ฒฉ๋ฆฌํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ค์šด ์ผ์ด๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ตœ๊ทผ ๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ๋Š” LLC ํŒŒํ‹ฐ์…”๋‹์„ ํ•˜๋“œ์›จ์–ด์ ์œผ๋กœ ์ œ๊ณตํ•˜๊ธฐ ์‹œ์ž‘ํ•˜์˜€๋‹ค. ์‚ฌ์šฉ์ž๋Š” ํ•˜๋“œ์›จ์–ด์ ์œผ๋กœ ์ œ๊ณต๋œ LLC ํŒŒํ‹ฐ์…”๋‹์„ ํ†ตํ•ด ํŠน์ • ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์— ์›ํ•˜๋Š” ์ˆ˜์ค€๋งŒํผ LLC๋ฅผ ํ• ๋‹นํ•˜์—ฌ ๋‹ค๋ฅธ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์œผ๋กœ๋ถ€ํ„ฐ ๊ฐ„์„ญ์„ ๋ฐ›์ง€ ์•Š๋„๋ก ๊ฒฉ๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค. ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ LLC ์šฉ๋Ÿ‰์„ ๋งŽ์ด ํ• ๋‹น ๋ฐ›์„์ˆ˜๋ก ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์ง€๋งŒ, ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๋” ๋งŽ์€ LLC ์šฉ๋Ÿ‰์„ ํ• ๋‹น ๋ฐ›์€ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์ด ์˜คํžˆ๋ ค ์„ฑ๋Šฅ ์ €ํ•˜๋œ๋‹ค๋Š” ์‚ฌ์‹ค(MiW, more is worse)์„ ํ•˜๋“œ์›จ์–ด์  ์‹คํ—˜์„ ํ†ตํ•ด ํ™•์ธํ•˜์˜€๋‹ค. ๋‹ค์–‘ํ•œ ํ†ต์ œ๋œ ์‹คํ—˜์„ ํ†ตํ•ด LLC ํŒŒํ‹ฐ์…”๋‹์„ ํ†ตํ•ด LLC ์šฉ๋Ÿ‰์„ ์ ๊ฒŒ ํ• ๋‹น ๋ฐ›์€ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์ด LLC ๋ฏธ์Šค๋ฅผ ๋” ์ž์ฃผ ๋ฐœ์ƒ์‹œํ‚จ๋‹ค๋Š” ์‚ฌ์‹ค์„ ํ™•์ผ ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. LLC ์šฉ๋Ÿ‰์„ ์ ๊ฒŒ ํ• ๋‹น ๋ฐ›์€ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์€ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋“ค์ด ๊ณต์œ ํ•˜๋Š” ๋ฉ”์ธ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์— ์ŠคํŠธ๋ ˆ์Šค๋ฅผ ๊ฐ€ํ•˜๊ณ , LLC ํŒŒํ‹ฐ์…”๋‹์„ ํ†ตํ•ด ์„œ๋กœ ๊ฒฉ๋ฆฌ๋ฅผ ํ•˜์˜€์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ์„ฑ๋Šฅ์„ ์ €ํ•˜์‹œ์ผฐ๋‹ค. MiW ํ˜„์ƒ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๋ฉ”์ธ ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ์˜ ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๋ฅผ ๊ฐ€์ƒํ™”ํ•˜๊ณ  LLC ํŒŒํ‹ฐ์…”๋‹์— ์˜ํ•ด ๊ทธ๋ฃนํ™”๋œ ๊ฐ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ ๊ทธ๋ฃน์— ์ „์šฉ์œผ๋กœ ํ• ๋‹น๋˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ๊ฐ€์ƒ ์ฑ„๋„(mVC)์„ ์ œ์•ˆํ•˜์˜€๋‹ค. mVC๋ฅผ ํ†ตํ•ด ๊ฐ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ ๊ทธ๋ฃน์€ ๋…๋ฆฝ์ ์ธ ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๋ฅผ ์†Œ์œ ํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ๊ฐ€์ƒํ™” ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ํŠน์ • ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ ๊ทธ๋ฃน์ด ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๋ฅผ ๋…์ ํ•˜๋”๋ผ๋„ ๋‹ค๋ฅธ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋“ค์€ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์œ ๋ฐœํ•  ์ˆ˜ ์—†๊ฒŒ ๋˜์–ด ์„œ๋กœ ๊ฒฉ๋ฆฌ๋œ ํ™˜๊ฒฝ์„ ์กฐ์„ฑํ•œ๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ mVC์˜ ๋ฒ„ํผ ํฌ๊ธฐ๋ฅผ ์กฐ์ •ํ•˜์—ฌ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ ๊ทธ๋ฃน์˜ ์„ฑ๋Šฅ ๋ฏธ์„ธ ์กฐ์ •์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•˜์˜€๋‹ค. mVC๋ฅผ ๋„์ž…ํ•จ์œผ๋กœ์จ ์ „์ฒด์ ์ธ ์‹œ์Šคํ…œ ๋น„์šฉ์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. ์ง€์—ฐ ์‹œ๊ฐ„์ด ์ค‘์š”ํ•œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๊ณผ ์ฒ˜๋ฆฌ๋Ÿ‰์ด ์ค‘์š”ํ•œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์„ ํ•จ๊ป˜ ์‹คํ–‰ํ•  ๋•Œ mVC๊ฐ€ ์—†์„ ๊ฒฝ์šฐ์—๋Š” ์ง€์—ฐ ์‹œ๊ฐ„์˜ ์„ฑ๋Šฅ ๊ธฐ์ค€์น˜๋ฅผ ๋งŒ์กฑํ•  ์ˆ˜ ์—†์—ˆ์ง€๋งŒ, mVC๋ฅผ ํ†ตํ•ด ์„ฑ๋Šฅ ๊ธฐ์ค€์น˜๋ฅผ ๋งŒ์กฑํ•˜๋ฉด์„œ ์‹œ์Šคํ…œ์˜ ์ด ๋น„์šฉ์„ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋ฉ€ํ‹ฐ ์นฉ ํ”„๋กœ์„ธ์„œ๋ฅผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” MiW ํ˜„์ƒ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ œ๊ฑฐํ•จ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๋˜ํ•œ, ๋ฐ์ดํ„ฐ ์„ผํ„ฐ์—์„œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋“ค์˜ ๋™์‹œ ์‹คํ–‰์„ ์œ„ํ•œ ์ถ”๊ฐ€์ ์ธ ๊ฐ€๋Šฅ์„ฑ์„ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ์‚ฌ๋ก€ ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด mVC๋ฅผ ๋„์ž…ํ•˜์—ฌ ์‹œ์Šคํ…œ ๋น„์šฉ์„ 21.8%๊นŒ์ง€ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€์œผ๋ฉฐ, mVC๋ฅผ ๋„์ž…ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ์—๋Š” ์„œ๋น„์Šค ๊ธฐ์ค€(SLO)์„ ๋งŒ์กฑํ•˜์ง€ ์•Š์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค.1. Introduction 1 1.1 Research Contributions 5 1.2 Outline 6 2. Background 7 2.1 Cache Hierarchy and Policies 7 2.2 Cache Partitioning 10 2.3 Benchmarks 15 2.3.1 Working Set Size 16 2.3.2 Top-down Analysis 17 2.3.3 Profiling Tools 19 3. More-is-Worse Phenonmenon 21 3.1 More LLC Leading to Performance Drop 21 3.2 Synthetic Workload Evaluation 27 3.3 Impact on Latency-critical Workloads 31 3.4 Workload Analysis 33 3.5 The Root Cause of the MiW Phenomenon 35 3.6 Limitations of Existing Solutions 41 3.6.1 Memory Bandwidth Throttling 41 3.6.2 Fairness-aware Memory Scheduling 44 4. Virtualizing Memory Channels 49 4.1 Memory Virtual Channel (mVC) 50 4.2 mVC Buffer Allocation Strategies 52 4.3 Evaluation 57 4.3.1 Experimental Setup 57 4.3.2 Reproducing Hardware Results 59 4.3.3 Mitigating MiW through mVC 60 4.3.4 Evaluation on Four Groups 64 4.3.5 Potentials for Operating Cost Savings with mVC 66 5. Related Work 71 5.1 Component-wise QoS/Fairness for Shared Resources 71 5.2 Holistic Approaches to QoS/Fairness 73 5.3 MiW on Recent Architectures 74 6. Conclusion 76 6.1 Discussion 78 6.2 Future Work 79 Bibliography 81 ๊ตญ๋ฌธ์ดˆ๋ก 89๋ฐ•
    • โ€ฆ
    corecore