87 research outputs found
Improving address translation performance in virtualized multi-tenant systems
With the explosive growth in dataset sizes, application memory footprints are commonly reaching hundreds of GBs. Such huge datasets pressure the TLBs, resulting
in frequent misses that must be resolved through a page walk – a long-latency pointer
chase through multiple levels of the in-memory radix-tree-based page table. Page walk
latency is particularly high under virtualization where address translation mandates traversing two radix-tree page tables in a process called a nested page walk, performing
up to 24 memory accesses. Page walk latency can be also amplified by the effects
caused by the colocation of applications on the same server used in an attempt to increase utilization. Under colocation, cache contention makes cache misses during a
nested page walk more frequent, piling up page walk latency. Both virtualization and
colocation are widely adopted in cloud platforms, such as Amazon Web Services and
Google Cloud Engine. As a result, in cloud environments, page walk latency can
reach hundreds of cycles, significantly reducing the overall application’s performance.
This thesis addresses the problem of the high page walk latency by 1 identifying
the sources of the high page walk latency under virtualization and/or colocation, and
2 proposing hardware and software techniques that accelerate page walks by means
of new memory allocation strategies for the page table and data which can be easily
adopted by existing systems.
Firstly, we quantify how the dataset size growth, virtualization, and colocation affect page walk latency. We also study how a high page walk latency affects perform ance. Due to the lack of dedicated tools for evaluating address translation overhead
on modern processors, we design a methodology to vary the page walk latency experienced by an application running on real hardware. To quantify the performance impact
of address translation, we measure the application’s execution time while varying the
page walk latency. We find that under virtualization, address translation considerably
limits performance: an application can waste up to 68% of execution time due to stalls
originating from page walks. In addition, we investigate which accesses from a nested
page walk are most significant for the overall page walk latency by examining from
where in the memory hierarchy these accesses are served. We find that accesses to the
deeper levels of the page table radix tree are responsible for most of the overall page
walk latency.
Based on these observations, we introduce two address translation acceleration
techniques that can be applied to any ISA that employs radix-tree page tables and
nested page walks. The first of these techniques is Prefetched Address Translation
(ASAP), a new software-hardware approach for mitigating the high page walk latency
caused by virtualization and/or application colocation. At the heart of ASAP is a
lightweight technique for directly indexing individual levels of the page table radix
tree. Direct indexing enables ASAP to fetch nodes from deeper levels of the page
table without first accessing the preceding levels, thus lowering the page walk latency.
ASAP is fully compatible with the existing radix-tree-based page table and requires
only incremental and isolated changes to the memory subsystem.
The second technique is PTEMagnet, a new software-only approach for reducing
address translation latency under virtualization and application colocation. Initially,
we identify a new address translation bottleneck caused by memory fragmentation
stemming from the interaction of virtualization, application colocation, and the Linux
memory allocator. The fragmentation results in the effective cache footprint of the
host page table being larger than that of the guest page table. The bloated footprint
of the host page table leads to frequent cache misses during nested page walks, increasing page walk latency. In response to these observations, we propose PTEMag net. PTEMagnet prevents memory fragmentation by fine-grained reservation-based
memory allocation in the guest OS. PTEMagnet is fully legacy-preserving, requiring
no modifications to either user code or mechanisms for address translation and virtualization.
In summary, this thesis proposes non-disruptive upgrades to the virtual memory
subsystem for reducing page walk latency in virtualized deployments. In doing so,
this thesis evaluates the impact of page walk latency on the application’s performance, identifies the bottlenecks of the existing address translation mechanism caused
by virtualization, application colocation, and the Linux memory allocator, and proposes software-hardware and software-only solutions for eliminating the bottlenecks
Page size aware cache prefetching
The increase in working set sizes of contemporary applications outpaces the growth in cache sizes, resulting in frequent main memory accesses that deteriorate system per- formance due to the disparity between processor and memory speeds. Prefetching data blocks into the cache hierarchy ahead of demand accesses has proven successful at attenuating this bottleneck. However, spatial cache prefetchers operating in the physical address space leave significant performance on the table by limiting their pattern detection within 4KB physical page boundaries when modern systems use page sizes larger than 4KB to mitigate the address translation overheads. This paper exploits the high usage of large pages in modern systems to increase the effectiveness of spatial cache prefetch- ing. We design and propose the Page-size Propagation Module (PPM), a µarchitectural scheme that propagates the page size information to the lower-level cache prefetchers, enabling safe prefetching beyond 4KB physical page boundaries when the accessed blocks reside in large pages, at the cost of augmenting the first-level caches’ Miss Status Holding Register (MSHR) entries with one additional bit. PPM is compatible with any cache prefetcher without implying design modifications. We capitalize on PPM’s benefits by designing a module that consists of two page size aware prefetchers that inherently use different page sizes to drive prefetching. The composite module uses adaptive logic to dynamically enable the most appropriate page size aware prefetcher. Finally, we show that the proposed designs are transparent to which cache prefetcher is used. We apply the proposed page size exploitation techniques to four state-of-the-art spatial cache prefetchers. Our evalua- tion shows that our proposals improve single-core geomean performance by up to 8.1% (2.1% at minimum) over the original implementation of the considered prefetchers, across 80 memory-intensive workloads. In multi-core contexts, we report geomean speedups up to 7.7% across different cache prefetchers and core configurations.This work is supported by the Spanish Ministry of Science and Technology through the PID2019-107255GB project, the Generalitat de Catalunya (contract 2017-SGR-1414), the European Union Horizon 2020 research and innovation program under grant agreement No 955606 (DEEP-SEA EU project), the National Science Foundation through grants CNS-1938064 and CCF-1912617, and the Semiconductor Research Corporation project GRC 2936.001. Georgios Vavouliotis has been supported by the Spanish Ministry of Economy, Industry, and Competitiveness and the European Social Fund under the FPI fellowship No. PRE2018-087046. Marc Casas has been partially supported by the Grant RYC2017-23269 funded by MCIN/AEI/10.13039/501100011033 and ESF ‘Investing in your future’.Peer ReviewedPostprint (author's final draft
Utopia: Fast and Efficient Address Translation via Hybrid Restrictive & Flexible Virtual-to-Physical Address Mappings
Conventional virtual memory (VM) frameworks enable a virtual address to
flexibly map to any physical address. This flexibility necessitates large data
structures to store virtual-to-physical mappings, which leads to high address
translation latency and large translation-induced interference in the memory
hierarchy. On the other hand, restricting the address mapping so that a virtual
address can only map to a specific set of physical addresses can significantly
reduce address translation overheads by using compact and efficient translation
structures. However, restricting the address mapping flexibility across the
entire main memory severely limits data sharing across different processes and
increases data accesses to the swap space of the storage device, even in the
presence of free memory. We propose Utopia, a new hybrid virtual-to-physical
address mapping scheme that allows both flexible and restrictive hash-based
address mapping schemes to harmoniously co-exist in the system. The key idea of
Utopia is to manage physical memory using two types of physical memory
segments: restrictive and flexible segments. A restrictive segment uses a
restrictive, hash-based address mapping scheme that maps virtual addresses to
only a specific set of physical addresses and enables faster address
translation using compact translation structures. A flexible segment employs
the conventional fully-flexible address mapping scheme. By mapping data to a
restrictive segment, Utopia enables faster address translation with lower
translation-induced interference. Utopia improves performance by 24% in a
single-core system over the baseline system, whereas the best prior
state-of-the-art contiguity-aware translation scheme improves performance by
13%.Comment: To appear in 56th IEEE/ACM International Symposium on
Microarchitecture (MICRO), 202
Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources
Address translation is a performance bottleneck in data-intensive workloads
due to large datasets and irregular access patterns that lead to frequent
high-latency page table walks (PTWs). PTWs can be reduced by using (i) large
hardware TLBs or (ii) large software-managed TLBs. Unfortunately, both
solutions have significant drawbacks: increased access latency, power and area
(for hardware TLBs), and costly memory accesses, the need for large contiguous
memory blocks, and complex OS modifications (for software-managed TLBs). We
present Victima, a new software-transparent mechanism that drastically
increases the translation reach of the processor by leveraging the
underutilized resources of the cache hierarchy. The key idea of Victima is to
repurpose L2 cache blocks to store clusters of TLB entries, thereby providing
an additional low-latency and high-capacity component that backs up the
last-level TLB and thus reduces PTWs. Victima has two main components. First, a
PTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on
the frequency and cost of the PTWs they lead to. Second, a TLB-aware cache
replacement policy prioritizes keeping TLB entries in the cache hierarchy by
considering (i) the translation pressure (e.g., last-level TLB miss rate) and
(ii) the reuse characteristics of the TLB entries. Our evaluation results show
that in native (virtualized) execution environments Victima improves average
end-to-end application performance by 7.4% (28.7%) over the baseline four-level
radix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art
software-managed TLB, across 11 diverse data-intensive workloads. Victima (i)
is effective in both native and virtualized environments, (ii) is completely
transparent to application and system software, and (iii) incurs very small
area and power overheads on a modern high-end CPU.Comment: To appear in 56th IEEE/ACM International Symposium on
Microarchitecture (MICRO), 202
A Survey of Techniques for Architecting TLBs
“Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used
in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently
and a TLB miss is extremely costly, prudent management of TLB is important for improving performance
and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and
managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and
distinctions. We believe that this paper will be useful for chip designers, computer architects and system
engineers
Recommended from our members
Designing systems for emerging memory technologies
Emerging memory technologies open new challenges in system software: diversity and large capacity.
Non-volatile memory (NVM) technologies will have excellent performance, byte- addressability, and large capacity, blurring the line between traditional volatile DRAM and non-volatile storage. NVM diverges from DRAM in significant ways, like limited write bandwidth. It is likely that future storage market will be diversified, having DRAM, NVM, SSD, and hard disk. Unfortunately, current file systems, built on top of old design ideas, cannot provide an efficient way to take advantage of the different storage media. Strata is a cross-media file system, fundamentally redesigning file systems to leverage different strengths of storage technologies while compensating their weaknesses.
Modern applications such as large-scale machine learning and graph analytics want to load huge datasets into memory for fast computation. For these workloads, merely adding more RAM to a machine reaches a point of diminishing returns for performance because their poor spatial locality causes them to suffer high virtual to physical memory translation costs. NVM will make this problem worse because it provides cheaper cost-per-capacity than DRAM. Ingens, a efficient memory management system, addresses the shortcomings in modern operating systems and hypervisors that underlies these excessive address translation overheads and redesign huge page memory systems to make huge page widely used in practice.Computer Science
A Survey on the Integration of NAND Flash Storage in the Design of File Systems and the Host Storage Software Stack
With the ever-increasing amount of data generate in the world, estimated to
reach over 200 Zettabytes by 2025, pressure on efficient data storage systems
is intensifying. The shift from HDD to flash-based SSD provides one of the most
fundamental shifts in storage technology, increasing performance capabilities
significantly. However, flash storage comes with different characteristics than
prior HDD storage technology. Therefore, storage software was unsuitable for
leveraging the capabilities of flash storage. As a result, a plethora of
storage applications have been design to better integrate with flash storage
and align with flash characteristics.
In this literature study we evaluate the effect the introduction of flash
storage has had on the design of file systems, which providing one of the most
essential mechanisms for managing persistent storage. We analyze the mechanisms
for effectively managing flash storage, managing overheads of introduced design
requirements, and leverage the capabilities of flash storage. Numerous methods
have been adopted in file systems, however prominently revolve around similar
design decisions, adhering to the flash hardware constrains, and limiting
software intervention. Future design of storage software remains prominent with
the constant growth in flash-based storage devices and interfaces, providing an
increasing possibility to enhance flash integration in the host storage
software stack
- …