456 research outputs found
An Intelligent Framework for Oversubscription Management in CPU-GPU Unified Memory
This paper proposes a novel intelligent framework for oversubscription
management in CPU-GPU UVM. We analyze the current rule-based methods of GPU
memory oversubscription with unified memory, and the current learning-based
methods for other computer architectural components. We then identify the
performance gap between the existing rule-based methods and the theoretical
upper bound. We also identify the advantages of applying machine intelligence
and the limitations of the existing learning-based methods. This paper proposes
a novel intelligent framework for oversubscription management in CPU-GPU UVM.
It consists of an access pattern classifier followed by a pattern-specific
Transformer-based model using a novel loss function aiming for reducing page
thrashing. A policy engine is designed to leverage the model's result to
perform accurate page prefetching and pre-eviction. We evaluate our intelligent
framework on a set of 11 memory-intensive benchmarks from popular benchmark
suites. Our solution outperforms the state-of-the-art (SOTA) methods for
oversubscription management, reducing the number of pages thrashed by 64.4\%
under 125\% memory oversubscription compared to the baseline, while the SOTA
method reduces the number of pages thrashed by 17.3\%. Our solution achieves an
average IPC improvement of 1.52X under 125\% memory oversubscription, and our
solution achieves an average IPC improvement of 3.66X under 150\% memory
oversubscription. Our solution outperforms the existing learning-based methods
for page address prediction, improving top-1 accuracy by 6.45\% (up to 41.2\%)
on average for a single GPGPU workload, improving top-1 accuracy by 10.2\% (up
to 30.2\%) on average for multiple concurrent GPGPU workloads.Comment: arXiv admin note: text overlap with arXiv:2203.1267
Mitigating Interference During Virtual Machine Live Migration through Storage Offloading
Today\u27s cloud landscape has evolved computing infrastructure into a dynamic, high utilization, service-oriented paradigm. This shift has enabled the commoditization of large-scale storage and distributed computation, allowing engineers to tackle previously untenable problems without large upfront investment. A key enabler of flexibility in the cloud is the ability to transfer running virtual machines across subnets or even datacenters using live migration. However, live migration can be a costly process, one that has the potential to interfere with other applications not involved with the migration. This work investigates storage interference through experimentation with real-world systems and well-established benchmarks. In order to address migration interference in general, a buffering technique is presented that offloads the migration\u27s read, eliminating interference in the majority of scenarios
Workload Behavior Driven Memory Subsystem Design for Hyperscale
Hyperscalars run services across a large fleet of servers, serving billions
of users worldwide. These services, however, behave differently than commonly
available benchmark suites, resulting in server architectures that are not
optimized for cloud workloads. With datacenters becoming a primary server
processor market, optimizing server processors for cloud workloads by better
understanding their behavior has become crucial. To address this, in this
paper, we present MemProf, a memory profiler that profiles the three major
reasons for stalls in cloud workloads: code-fetch, memory bandwidth, and memory
latency. We use MemProf to understand the behavior of cloud workloads and
propose and evaluate micro-architectural and memory system design improvements
that help cloud workloads' performance.
MemProf's code analysis shows that cloud workloads execute the same code
across CPU cores. Using this, we propose shared micro-architectural
structures--a shared L2 I-TLB and a shared L2 cache. Next, to help with memory
bandwidth stalls, using workloads' memory bandwidth distribution, we find that
only a few pages contribute to most of the system bandwidth. We use this
finding to evaluate a new high-bandwidth, small-capacity memory tier and show
that it performs 1.46x better than the current baseline configuration. Finally,
we look into ways to improve memory latency for cloud workloads. Profiling
using MemProf reveals that L2 hardware prefetchers, a common solution to reduce
memory latency, have very low coverage and consume a significant amount of
memory bandwidth. To help improve hardware prefetcher performance, we built a
memory tracing tool to collect and validate production memory access traces
Cost-effective compiler directed memory prefetching and bypassing
Ever increasing memory latencies and deeper pipelines push memory farther from the processor. Prefetching techniques aim is to bridge these two gaps by fetching data in advance to both the L1 cache and the register file. Our main contribution in this paper is a hybrid approach to the prefetching problem that combines both software and hardware prefetching in a cost-effective way by needing very little hardware support and impacting minimally the design of the processor pipeline. The prefetcher is built on-top of a static memory instruction bypassing, which is in charge of bringing prefetched values in the register file. In this paper we also present a thorough analysis of the limits of both prefetching and memory instruction bypassing. We also compare our prefetching technique with a prior speculative proposal that attacked the same problem, and we show that at much lower cost, our hybrid solution is better than a realistic implementation of speculative prefetching and bypassing. On average, our hybrid implementation achieves a 13% speed-up improvement over a version with software prefetching in a subset of numerical applications and an average of 43% over a version with no software prefetching (achieving up to a 102% for specific benchmarks).Peer ReviewedPostprint (published version
Automatic Sharing Classification and Timely Push for Cache-coherent Systems
This paper proposes and evaluates Sharing/Timing Adaptive Push (STAP), a dynamic scheme for preemptively sending data from producers to consumers to minimize criticalpath communication latency. STAP uses small hardware buffers to dynamically detect sharing patterns and timing requirements. The scheme applies to both intra-node and inter-socket directorybased shared memory networks. We integrate STAP into a MOESI cache-coherence protocol using heuristics to detect different data sharing patterns, including broadcasts, producer/consumer, and migratory-data sharing. Using 12 benchmarks from the PARSEC and SPLASH-2 suites in 3 different configurations, we show that our scheme significantly reduces communication latency in NUMA systems and achieves an average of 10% performance improvement (up to 46%), with at most 2% on-chip storage overhead. When combined with existing prefetch schemes, STAP either outperforms prefetching or combines with prefetching for improved performance (up to 15% extra) in most cases
- …