287 research outputs found
Design Guidelines for High-Performance SCM Hierarchies
With emerging storage-class memory (SCM) nearing commercialization, there is
evidence that it will deliver the much-anticipated high density and access
latencies within only a few factors of DRAM. Nevertheless, the
latency-sensitive nature of memory-resident services makes seamless integration
of SCM in servers questionable. In this paper, we ask the question of how best
to introduce SCM for such servers to improve overall performance/cost over
existing DRAM-only architectures. We first show that even with the most
optimistic latency projections for SCM, the higher memory access latency
results in prohibitive performance degradation. However, we find that
deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the
performance of an SCM-mostly memory system competitive. The high degree of
spatial locality that memory-resident services exhibit not only simplifies the
DRAM cache's design as page-based, but also enables the amortization of
increased SCM access latencies and the mitigation of SCM's read/write latency
disparity.
We identify the set of memory hierarchy design parameters that plays a key
role in the performance and cost of a memory system combining an SCM technology
and a 3D stacked DRAM cache. We then introduce a methodology to drive
provisioning for each of these design parameters under a target
performance/cost goal. Finally, we use our methodology to derive concrete
results for specific SCM technologies. With PCM as a case study, we show that a
two bits/cell technology hits the performance/cost sweet spot, reducing the
memory subsystem cost by 40% while keeping performance within 3% of the best
performing DRAM-only system, whereas single-level and triple-level cell
organizations are impractical for use as memory replacements.Comment: Published at MEMSYS'1
Exploiting heterogeneity in Chip-Multiprocessor Design
In the past decade, semiconductor manufacturers are persistent in building faster and smaller transistors in order to boost the processor performance as projected by Mooreโs Law. Recently, as we enter the deep submicron regime, continuing the same processor development pace becomes an increasingly difficult issue due to constraints on power, temperature, and the scalability of transistors. To overcome these challenges, researchers propose several innovations at both architecture and device levels that are able to partially solve the problems. These diversities in processor architecture and manufacturing materials provide solutions to continuing Mooreโs Law by effectively exploiting the heterogeneity, however, they also introduce a set of unprecedented challenges that have been rarely addressed in prior works. In this dissertation, we present a series of in-depth studies to comprehensively investigate the design and optimization of future multi-core and many-core platforms through exploiting heteroge-neities. First, we explore a large design space of heterogeneous chip multiprocessors by exploiting the architectural- and device-level heterogeneities, aiming to identify the optimal design patterns leading to attractive energy- and cost-efficiencies in the pre-silicon stage. After this high-level study, we pay specific attention to the architectural asymmetry, aiming at developing a heterogeneity-aware task scheduler to optimize the energy-efficiency on a given single-ISA heterogeneous multi-processor. An advanced statistical tool is employed to facilitate the algorithm development. In the third study, we shift our concentration to the device-level heterogeneity and propose to effectively leverage the advantages provided by different materials to solve the increasingly important reliability issue for future processors
STT-RAM์ ์ด์ฉํ ์๋์ง ํจ์จ์ ์ธ ์บ์ ์ค๊ณ ๊ธฐ์
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2019. 2. ์ต๊ธฐ์.์ง๋ ์์ญ ๋
๊ฐ '๋ฉ๋ชจ๋ฆฌ ๋ฒฝ' ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ์จ ์นฉ ์บ์์ ํฌ๊ธฐ๋ ๊พธ์คํ ์ฆ๊ฐํด์๋ค. ํ์ง๋ง ์ง๊ธ๊น์ง ์บ์์ ์ฃผ๋ก ์ฌ์ฉ๋์ด ์จ ๋ฉ๋ชจ๋ฆฌ ๊ธฐ์ ์ธ SRAM์ ๋ฎ์ ์ง์ ๋์ ๋์ ๋๊ธฐ ์ ๋ ฅ ์๋ชจ๋ก ์ธํด ํฐ ์บ์๋ฅผ ๊ตฌ์ฑํ๋ ๋ฐ์๋ ์ ํฉํ์ง ์๋ค. ์ด๋ฌํ SRAM์ ๋จ์ ์ ๋ณด์ํ๊ธฐ ์ํด ๋ ๋์ ์ง์ ๋์ ๋ฎ์ ๋๊ธฐ ์ ๋ ฅ์ ์๋ชจํ๋ ์๋ก์ด ๋ฉ๋ชจ๋ฆฌ ๊ธฐ์ ์ธ STT-RAM์ผ๋ก SRAM์ ๋์ฒดํ๋ ๊ฒ์ด ์ ์๋์๋ค. ํ์ง๋ง STT-RAM์ ๋ฐ์ดํฐ๋ฅผ ์ธ ๋ ๋ง์ ์๋์ง์ ์๊ฐ์ ์๋นํ๊ธฐ ๋๋ฌธ์ ๋จ์ํ SRAM์ STT-RAM์ผ๋ก ๋์ฒดํ๋ ๊ฒ์ ์คํ๋ ค ์บ์ ์๋์ง ์๋น๋ฅผ ์ฆ๊ฐ์ํจ๋ค. ์ด๋ฌํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ๋ณธ ๋
ผ๋ฌธ์์๋ STT-RAM์ ์ด์ฉํ ์๋์ง ํจ์จ์ ์ธ ์บ์ ์ค๊ณ ๊ธฐ์ ๋ค์ ์ ์ํ๋ค.
์ฒซ ๋ฒ์งธ, ๋ฐฐํ์ ์บ์ ๊ณ์ธต ๊ตฌ์กฐ์์ STT-RAM์ ํ์ฉํ๋ ๋ฐฉ๋ฒ์ ์ ์ํ์๋ค. ๋ฐฐํ์ ์บ์ ๊ณ์ธต ๊ตฌ์กฐ๋ ๊ณ์ธต ๊ฐ์ ์ค๋ณต๋ ๋ฐ์ดํฐ๊ฐ ์๊ธฐ ๋๋ฌธ์ ํฌํจ์ ์บ์ ๊ณ์ธต ๊ตฌ์กฐ์ ๋น๊ตํ์ฌ ๋ ํฐ ์ ํจ ์ฉ๋์ ๊ฐ์ง๋ง, ๋ฐฐํ์ ์บ์ ๊ณ์ธต ๊ตฌ์กฐ์์๋ ์์ ๋ ๋ฒจ ์บ์์์ ๋ด๋ณด๋ด์ง ๋ชจ๋ ๋ฐ์ดํฐ๋ฅผ ํ์ ๋ ๋ฒจ ์บ์์ ์จ์ผ ํ๋ฏ๋ก ๋ ๋ง์ ์์ ๋ฐ์ดํฐ๋ฅผ ์ฐ๊ฒ ๋๋ค. ์ด๋ฌํ ๋ฐฐํ์ ์บ์ ๊ณ์ธต ๊ตฌ์กฐ์ ํน์ฑ์ ์ฐ๊ธฐ ํน์ฑ์ด ๋จ์ ์ธ STT-RAM์ ํจ๊ป ํ์ฉํ๋ ๊ฒ์ ์ด๋ ต๊ฒ ํ๋ค. ์ด๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ๋ณธ ๋
ผ๋ฌธ์์๋ ์ฌ์ฌ์ฉ ๊ฑฐ๋ฆฌ ์์ธก์ ๊ธฐ๋ฐ์ผ๋ก ํ๋ SRAM/STT-RAM ํ์ด๋ธ๋ฆฌ๋ ์บ์ ๊ตฌ์กฐ๋ฅผ ์ค๊ณํ์๋ค.
๋ ๋ฒ์งธ, ๋นํ๋ฐ์ฑ STT-RAM์ ์ด์ฉํด ์บ์๋ฅผ ์ค๊ณํ ๋ ๊ณ ๋ คํด์ผ ํ ์ ๋ค์ ๋ํด ๋ถ์ํ์๋ค. STT-RAM์ ๋นํจ์จ์ ์ธ ์ฐ๊ธฐ ๋์์ ์ค์ด๊ธฐ ์ํด ๋ค์ํ ํด๊ฒฐ๋ฒ๋ค์ด ์ ์๋์๋ค. ๊ทธ์ค ํ ๊ฐ์ง๋ STT-RAM ์์๊ฐ ๋ฐ์ดํฐ๋ฅผ ์ ์งํ๋ ์๊ฐ์ ์ค์ฌ (ํ๋ฐ์ฑ STT-RAM) ์ฐ๊ธฐ ํน์ฑ์ ํฅ์ํ๋ ๋ฐฉ๋ฒ์ด๋ค. STT-RAM์ ์ ์ฅ๋ ๋ฐ์ดํฐ๋ฅผ ์๋ ๊ฒ์ ํ๋ฅ ์ ์ผ๋ก ๋ฐ์ํ๊ธฐ ๋๋ฌธ์ ์ ์ฅ๋ ๋ฐ์ดํฐ๋ฅผ ์์ ์ ์ผ๋ก ์ ์งํ๊ธฐ ์ํด์๋ ์ค๋ฅ ์ ์ ๋ถํธ(ECC)๋ฅผ ์ด์ฉํด ์ฃผ๊ธฐ์ ์ผ๋ก ์ค๋ฅ๋ฅผ ์ ์ ํด์ฃผ์ด์ผ ํ๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ STT-RAM ๋ชจ๋ธ์ ์ด์ฉํ์ฌ ํ๋ฐ์ฑ STT-RAM ์ค๊ณ ์์๋ค์ ๋ํด ๋ถ์ํ์๊ณ ์คํ์ ํตํด ํด๋น ์ค๊ณ ์์๋ค์ด ์บ์ ์๋์ง์ ์ฑ๋ฅ์ ์ฃผ๋ ์ํฅ์ ๋ณด์ฌ์ฃผ์๋ค.
๋ง์ง๋ง์ผ๋ก, ๋งค๋์ฝ์ด ์์คํ
์์์ ๋ถ์ฐ ํ์ด๋ธ๋ฆฌ๋ ์บ์ ๊ตฌ์กฐ๋ฅผ ์ค๊ณํ์๋ค. ๋จ์ํ ๊ธฐ์กด์ ํ์ด๋ธ๋ฆฌ๋ ์บ์์ ๋ถ์ฐ์บ์๋ฅผ ๊ฒฐํฉํ๋ฉด ํ์ด๋ธ๋ฆฌ๋ ์บ์์ ํจ์จ์ฑ์ ํฐ ์ํฅ์ ์ฃผ๋ SRAM ํ์ฉ๋๊ฐ ๋ฎ์์ง๋ค. ๋ฐ๋ผ์ ๊ธฐ์กด์ ํ์ด๋ธ๋ฆฌ๋ ์บ์ ๊ตฌ์กฐ์์์ ์๋์ง ๊ฐ์๋ฅผ ๊ธฐ๋ํ ์ ์๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ ๋ถ์ฐ ํ์ด๋ธ๋ฆฌ๋ ์บ์ ๊ตฌ์กฐ์์ SRAM ํ์ฉ๋๋ฅผ ๋์ผ ์ ์๋ ๋ ๊ฐ์ง ์ต์ ํ ๊ธฐ์ ์ธ ๋ฑ
ํฌ-๋ด๋ถ ์ต์ ํ์ ๋ฑ
ํฌ๊ฐ ์ต์ ํ ๊ธฐ์ ์ ์ ์ํ์๋ค. ๋ฑ
ํฌ-๋ด๋ถ ์ต์ ํ๋ highly-associative ์บ์๋ฅผ ํ์ฉํ์ฌ ๋ฑ
ํฌ ๋ด๋ถ์์ ์ฐ๊ธฐ ๋์์ด ๋ง์ ๋ฐ์ดํฐ๋ฅผ ๋ถ์ฐ์ํค๋ ๊ฒ์ด๊ณ ๋ฑ
ํฌ๊ฐ ์ต์ ํ๋ ์๋ก ๋ค๋ฅธ ์บ์ ๋ฑ
ํฌ์ ์ฐ๊ธฐ ๋์์ด ๋ง์ ๋ฐ์ดํฐ๋ฅผ ๊ณ ๋ฅด๊ฒ ๋ถ์ฐ์ํค๋ ์ต์ ํ ๋ฐฉ๋ฒ์ด๋ค.Over the last decade, the capacity of on-chip cache is continuously increased to mitigate the memory wall problem. However, SRAM, which is a dominant memory technology for caches, is not suitable for such a large cache because of its low density and large static power. One way to mitigate these downsides of the SRAM cache is replacing SRAM with a more efficient memory technology. Spin-Transfer Torque RAM (STT-RAM), one of the emerging memory technology, is a promising candidate for the alternative of SRAM. As a substitute of SRAM, STT-RAM can compensate drawbacks of SRAM with its non-volatility and small cell size. However, STT-RAM has poor write characteristics such as high write energy and long write latency and thus simply replacing SRAM to STT-RAM increases cache energy. To overcome those poor write characteristics of STT-RAM, this dissertation explores three different design techniques for energy-efficient cache using STT-RAM.
The first part of the dissertation focuses on combining STT-RAM with exclusive cache hierarchy. Exclusive caches are known to provide higher effective cache capacity than inclusive caches by removing duplicated copies of cache blocks across hierarchies. However, in exclusive cache hierarchies, every block evicted from the upper-level cache is written back to the last-level cache regardless of its dirtiness thereby incurring extra write overhead. This makes it challenging to use STT-RAM for exclusive last-level caches due to its high write energy and long write latency. To mitigate this problem, we design an SRAM/STT-RAM hybrid cache architecture based on reuse distance prediction.
The second part of the dissertation explores trade-offs in the design of volatile STT-RAM cache. Due to the inefficient write operation of STT-RAM, various solutions have been proposed to tackle this inefficiency. One of the proposed solutions is redesigning STT-RAM cell for better write characteristics at the cost of shortened retention time (i.e., volatile STT-RAM). Since the retention failure of STT-RAM has a stochastic property, an extra overhead of periodic scrubbing with error correcting code (ECC) is required to tolerate the failure. With an analysis based on analytic STT-RAM model, we have conducted extensive experiments on various volatile STT-RAM cache design parameters including scrubbing period, ECC strength, and target failure rate. The experimental results show the impact of the parameter variations on last-level cache energy and performance and provide a guideline for designing a volatile STT-RAM with ECC and scrubbing.
The last part of the dissertation proposes Benzene, an energy-efficient distributed SRAM/STT-RAM hybrid cache architecture for manycore systems running multiple applications. It is based on the observation that a naive application of hybrid cache techniques to distributed caches in a manycore architecture suffers from limited energy reduction due to uneven utilization of scarce SRAM. We propose two-level optimization techniques: intra-bank and inter-bank. Intra-bank optimization leverages highly-associative cache design, achieving more uniform distribution of writes within a bank. Inter-bank optimization evenly balances the amount of write-intensive data across the banks.Abstract i
Contents iii
List of Figures vii
List of Tables xi
Chapter 1 Introduction 1
1.1 Exclusive Last-Level Hybrid Cache 2
1.2 Designing Volatile STT-RAM Cache 4
1.3 Distributed Hybrid Cache 5
Chapter 2 Background 9
2.1 STT-RAM 9
2.1.1 Thermal Stability 10
2.1.2 Read and Write Operation of STT-RAM 11
2.1.3 Failures of STT-RAM 11
2.1.4 Volatile STT-RAM 13
2.1.5 Related Work 14
2.2 Exclusive Last-Level Hybrid Cache 18
2.2.1 Cache Hierarchies 18
2.2.2 Related Work 19
2.3 Distributed Hybrid Cache 21
2.3.1 Prediction Hybrid Cache 21
2.3.2 Distributed Cache Partitioning 22
2.3.3 Related Work 23
Chapter 3 Exclusive Last-Level Hybrid Cache 27
3.1 Motivation 27
3.1.1 Exclusive Cache Hierarchy 27
3.1.2 Reuse Distance 29
3.2 Architecture 30
3.2.1 Reuse Distance Predictor 30
3.2.2 Hybrid Cache Architecture 32
3.3 Evaluation 34
3.3.1 Methodology 34
3.3.2 LLC Energy Consumption 35
3.3.3 Main Memory Energy Consumption 38
3.3.4 Performance 39
3.3.5 Area Overhead 39
3.4 Summary 39
Chapter 4 Designing Volatile STT-RAM Cache 41
4.1 Analysis 41
4.1.1 Retention Failure of a Volatile STT-RAM Cell 41
4.1.2 Memory Array Design 43
4.2 Evaluation 45
4.2.1 Methodology 45
4.2.2 Last-Level Cache Energy 46
4.2.3 Performance 51
4.3 Summary 52
Chapter 5 Distributed Hybrid Cache 55
5.1 Motivation 55
5.2 Architecture 58
5.2.1 Intra-Bank Optimization 59
5.2.2 Inter-Bank Optimization 63
5.2.3 Other Optimizations 67
5.3 Evaluation Methodology 69
5.4 Evaluation Results 73
5.4.1 Energy Consumption and Performance 73
5.4.2 Analysis of Intra-bank Optimization 76
5.4.3 Analysis of Inter-bank Optimization 78
5.4.4 Impact of Inter-Bank Optimization on Network Energy 79
5.4.5 Sensitivity Analysis 80
5.4.6 Implementation Overhead 81
5.5 Summary 82
Chapter 6 Conculsion 85
Bibliography 88
์ด๋ก 101Docto
Towards Energy-Efficient and Reliable Computing: From Highly-Scaled CMOS Devices to Resistive Memories
The continuous increase in transistor density based on Moore\u27s Law has led us to highly scaled Complementary Metal-Oxide Semiconductor (CMOS) technologies. These transistor-based process technologies offer improved density as well as a reduction in nominal supply voltage. An analysis regarding different aspects of 45nm and 15nm technologies, such as power consumption and cell area to compare these two technologies is proposed on an IEEE 754 Single Precision Floating-Point Unit implementation. Based on the results, using the 15nm technology offers 4-times less energy and 3-fold smaller footprint. New challenges also arise, such as relative proportion of leakage power in standby mode that can be addressed by post-CMOS technologies. Spin-Transfer Torque Random Access Memory (STT-MRAM) has been explored as a post-CMOS technology for embedded and data storage applications seeking non-volatility, near-zero standby energy, and high density. Towards attaining these objectives for practical implementations, various techniques to mitigate the specific reliability challenges associated with STT-MRAM elements are surveyed, classified, and assessed herein. Cost and suitability metrics assessed include the area of nanomagmetic and CMOS components per bit, access time and complexity, Sense Margin (SM), and energy or power consumption costs versus resiliency benefits. In an attempt to further improve the Process Variation (PV) immunity of the Sense Amplifiers (SAs), a new SA has been introduced called Adaptive Sense Amplifier (ASA). ASA can benefit from low Bit Error Rate (BER) and low Energy Delay Product (EDP) by combining the properties of two of the commonly used SAs, Pre-Charge Sense Amplifier (PCSA) and Separated Pre-Charge Sense Amplifier (SPCSA). ASA can operate in either PCSA or SPCSA mode based on the requirements of the circuit such as energy efficiency or reliability. Then, ASA is utilized to propose a novel approach to actually leverage the PV in Non-Volatile Memory (NVM) arrays using Self-Organized Sub-bank (SOS) design. SOS engages the preferred SA alternative based on the intrinsic as-built behavior of the resistive sensing timing margin to reduce the latency and power consumption while maintaining acceptable access time
Paving the Path for Heterogeneous Memory Adoption in Production Systems
Systems from smartphones to data-centers to supercomputers are increasingly heterogeneous, comprising various memory technologies and core types. Heterogeneous memory systems provide an opportunity to suitably match varying memory access pat- terns in applications, reducing CPU time thus increasing performance per dollar resulting in aggregate savings of millions of dollars in large-scale systems. However, with increased provisioning of main memory capacity per machine and differences in memory characteristics (for example, bandwidth, latency, cost, and density), memory management in such heterogeneous memory systems poses multi-fold challenges on system programmability and design.
In this thesis, we tackle memory management of two heterogeneous memory systems: (a) CPU-GPU systems with a unified virtual address space, and (b) Cloud computing platforms that can deploy cheaper but slower memory technologies alongside DRAMs to reduce cost of memory in data-centers. First, we show that operating systems do not have sufficient information to optimally manage pages in bandwidth-asymmetric systems and thus fail to maximize bandwidth to massively-threaded GPU applications sacrificing GPU throughput. We present BW-AWARE placement/migration policies to support OS to make optimal data management decisions. Second, we present a CPU-GPU cache coherence design where CPU and GPU need not implement same cache coherence protocol but provide cache-coherent memory interface to the programmer. Our proposal is first practical approach to provide a unified, coherent CPUโGPU address space without requiring hardware cache coherence, with a potential to enable an explosion in algorithms that leverage tightly coupled CPUโGPU coordination.
Finally, to reduce the cost of memory in cloud platforms where the trend has been to map datasets in memory, we make a case for a two-tiered memory system where cheaper (per bit) memories, such as Intel/Microns 3D XPoint, will be deployed alongside DRAM. We present Thermostat, an application-transparent huge-page-aware software mechanism to place pages in a dual-technology hybrid memory system while achieving both the cost advantages of two-tiered memory and performance advantages of transparent huge pages. With Thermostatโs capability to control the application slowdown on a per application basis, cloud providers can realize cost savings from upcoming cheaper memory technologies by shifting infrequently accessed cold data to slow memory, while satisfying throughput demand of the customers.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/137052/1/nehaag_1.pd
Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and Applications
The challenging deployment of compute-intensive applications from domains
such Artificial Intelligence (AI) and Digital Signal Processing (DSP), forces
the community of computing systems to explore new design approaches.
Approximate Computing appears as an emerging solution, allowing to tune the
quality of results in the design of a system in order to improve the energy
efficiency and/or performance. This radical paradigm shift has attracted
interest from both academia and industry, resulting in significant research on
approximation techniques and methodologies at different design layers (from
system down to integrated circuits). Motivated by the wide appeal of
Approximate Computing over the last 10 years, we conduct a two-part survey to
cover key aspects (e.g., terminology and applications) and review the
state-of-the art approximation techniques from all layers of the traditional
computing stack. In Part II of our survey, we classify and present the
technical details of application-specific and architectural approximation
techniques, which both target the design of resource-efficient
processors/accelerators & systems. Moreover, we present a detailed analysis of
the application spectrum of Approximate Computing and discuss open challenges
and future directions.Comment: Under Review at ACM Computing Survey
Low-power instruction-caches design for embedded microprocessors
Ph.DDOCTOR OF PHILOSOPH
Recommended from our members
QoS-aware mechanisms for improving cost-efficiency of datacenters
Warehouse Scale Computers (WSCs) promise high cost-efficiency by amortizing power, cooling, and management overheads. WSCs today host a large variety of jobs with two broad performance requirements categories: latency-critical (LC) and best-effort (BE). Ideally, to fully utilize all hardware resources, WSC operators can simply fill all the nodes with computing jobs. Unfortunately, because colocated jobs contend for shared resources, systems with high loads often experience performance degradation, which negatively impacts the Quality of Service (QoS) for LC jobs. In fact, service providers usually over-provision resources to avoid any interference with LC jobs, leading to significant resource inefficiencies. In this dissertation, I explore opportunities across different system-abstraction layers to improve the cost-efficiency of dataceters by increasing resource utilization of WSCs with little or no impact on the performance of LC jobs. The dissertation has three main components. First, I explore opportunities to improve the throughput of multicore systems by reducing the performance variation of LC jobs. The main insight is that by reshaping the latency distribution curve, performance headroom of LC jobs can be effectively converted to improved BE throughput. I develop, implement, and evaluate a runtime system that achieves this goal with existing hardware. I leverage the cache partitioning, per-core frequency scaling, and thread masking of server processors. Evaluation results show the proposed solution enables 30% higher system throughput compared to solutions proposed in prior works while maintaining at least as good QoS for LC jobs. Second, I study resource contention in near-future heterogeneous memory architectures (HMA). This study is motivated by recent developments in non-volatile memory (NVM) technologies, which enable higher storage density at the cost of same performance. To understand the performance and QoS impact of HMAs, I design and implement a performance emulator in the Linux kernel that runs unmodified workloads with high accuracy, low overhead, and complete transparency. I further propose and evaluate multiple data and resource management QoS mechanisms, such as locality-aware page admission, occupancy management, and write buffer jailing. Third, I focus on accelerated machine learning (ML) systems. By profiling the performance of production workloads and accelerators, I show that accelerated ML tasks are highly sensitive to main memory interference due to fine-grained interaction between CPU and accelerator tasks. As a result, memory resource contention can significantly decreases the performance and efficiency gains of accelerators. I propose a runtime system that leverages existing hardware capabilities and show 17% higher system efficiency compared to previous approaches. This study further exposes opportunities for future processor architecturesElectrical and Computer Engineerin
- โฆ