3 research outputs found
Shared Resource Management for Non-Volatile Asymmetric Memory
Non-volatile memory (NVM), such as Phase-Change Memory (PCM), is a promising energy-efficient candidate to replace DRAM. It is desirable because of its non-volatility, good scalability and low idle power. NVM, nevertheless, faces important challenges. The main problems are: writes are much slower and more power hungry than reads and write bandwidth is much lower than read bandwidth. Hybrid main memory architecture, which consists of a large NVM and a small DRAM, may become a solution for architecting NVM as main memory. Adding an extra layer of cache mitigates the drawbacks of NVM writes. However, writebacks from the last-level cache (LLC) might still (a) overwhelm the limited NVM write bandwidth and stall the application, (b) shorten lifetime and (c) increase energy consumption.
Effectively utilizing shared resources, such as the last-level cache and the memory bandwidth, is crucial to achieving high performance for multi-core systems. No existing cache and bandwidth allocation scheme exploits the read/write asymmetry property, which is fundamental in NVM. This thesis tries to consider the asymmetry property in partitioning the cache and memory bandwidth for NVM systems.
The thesis proposes three writeback-aware schemes to manage the resources in NVM systems. First, a runtime mechanism, Writeback-aware Cache Partitioning (WCP), is proposed to partition the shared LLC among multiple applications. Unlike past partitioning schemes, WCP considers the reduction in cache misses as well as writebacks. Second, a new runtime mechanism, Writeback-aware Bandwidth Partitioning (WBP), partitions NVM service cycles among applications. WBP uses a bandwidth partitioning weight to reflect the importance of writebacks (in addition to LLC misses) to bandwidth allocation. A companion Dynamic Weight Adjustment scheme dynamically selects the cache partitioning weight to maximize system performance. Third, Unified Writeback-aware Partitioning (UWP) partitions the last-level cache and the memory bandwidth cooperatively. UWP can further improve the system performance by considering the interaction of cache partitioning and bandwidth partitioning. The three proposed schemes improve system performance by considering the unique read/write asymmetry property of NVM
Recommended from our members
Improving virtual memory performance in virtualized environments
Virtual Memory is a major system performance bottleneck in virtualized environments. In addition to expensive address translations, frequent virtual machine context switches are common in virtualized environments, resulting in increased TLB miss rates, subsequent expensive page walks and data cache contention due to incoming page table entries evicting useful data. Orthogonally, translation coherence, which is currently an expensive operation implemented in software, can consume up to 50% of the runtime of an application executing on the guest. To improve the performance of virtual memory in virtualized environments, two solutions have been proposed in this thesis - namely, (1) Context Switch Aware Large TLB (CSALT), an architecture which addresses the problem of increased TLB miss rates and their adverse impact on data caches. CSALT copes with the increased demand of context switches by storing a large number TLB entries. It mitigates data cache contention by employing a novel TLB-aware cache partitioning scheme. On 8-core systems that switch between two virtual machine contexts executing multi-threaded workloads, CSALT achieves an average performance improvement of 85% over a baseline with conventional L1-L2 TLBs and 25% over a baseline which has a large L3 TLB (2) Translation Coherence using Addressable TLBs (TCAT), a hardware translation coherence scheme which eliminates almost all of the overheads associated with address translation coherence. TCAT overlays translation coherence atop cache coherence to accurately identify slave cores. It then leverages the addressable Part-Of-Memory TLB (POM-TLB) to eliminate expensive Inter Processor Interrupts (IPI) and achieve precise invalidations on the slave core. On 8-core systems with one virtual machine context executing multi-threaded workloads, TCAT achieves an average performance improvement of 13% over the kvmtlb baselineElectrical and Computer Engineerin
Online learning on the programmable dataplane
This thesis makes the case for managing computer networks with datadriven methods automated statistical inference and control based on measurement data and runtime observations—and argues for their tight integration with programmable dataplane hardware to make management decisions faster and from more precise data. Optimisation, defence, and measurement of networked infrastructure are each challenging tasks in their own right, which are currently dominated by the use of hand-crafted heuristic methods. These become harder to reason about and deploy as networks scale in rates and number of forwarding elements, but their design requires expert knowledge and care around unexpected protocol interactions. This makes tailored, per-deployment or -workload solutions infeasible to develop. Recent advances in machine learning offer capable function approximation and closed-loop control which suit many of these tasks. New, programmable dataplane hardware enables more agility in the network— runtime reprogrammability, precise traffic measurement, and low latency on-path processing. The synthesis of these two developments allows complex decisions to be made on previously unusable state, and made quicker by offloading inference to the network.
To justify this argument, I advance the state of the art in data-driven defence of networks, novel dataplane-friendly online reinforcement learning algorithms, and in-network data reduction to allow classification of switchscale data. Each requires co-design aware of the network, and of the failure modes of systems and carried traffic. To make online learning possible in the dataplane, I use fixed-point arithmetic and modify classical (non-neural) approaches to take advantage of the SmartNIC compute model and make use of rich device local state. I show that data-driven solutions still require great care to correctly design, but with the right domain expertise they can improve on pathological cases in DDoS defence, such as protecting legitimate UDP traffic. In-network aggregation to histograms is shown to enable accurate classification from fine temporal effects, and allows hosts to scale such classification to far larger flow counts and traffic volume. Moving reinforcement learning to the dataplane is shown to offer substantial benefits to stateaction latency and online learning throughput versus host machines; allowing policies to react faster to fine-grained network events. The dataplane environment is key in making reactive online learning feasible—to port further algorithms and learnt functions, I collate and analyse the strengths of current and future hardware designs, as well as individual algorithms