87 research outputs found

    Improving address translation performance in virtualized multi-tenant systems

    Get PDF
    With the explosive growth in dataset sizes, application memory footprints are commonly reaching hundreds of GBs. Such huge datasets pressure the TLBs, resulting in frequent misses that must be resolved through a page walk – a long-latency pointer chase through multiple levels of the in-memory radix-tree-based page table. Page walk latency is particularly high under virtualization where address translation mandates traversing two radix-tree page tables in a process called a nested page walk, performing up to 24 memory accesses. Page walk latency can be also amplified by the effects caused by the colocation of applications on the same server used in an attempt to increase utilization. Under colocation, cache contention makes cache misses during a nested page walk more frequent, piling up page walk latency. Both virtualization and colocation are widely adopted in cloud platforms, such as Amazon Web Services and Google Cloud Engine. As a result, in cloud environments, page walk latency can reach hundreds of cycles, significantly reducing the overall application’s performance. This thesis addresses the problem of the high page walk latency by 1 identifying the sources of the high page walk latency under virtualization and/or colocation, and 2 proposing hardware and software techniques that accelerate page walks by means of new memory allocation strategies for the page table and data which can be easily adopted by existing systems. Firstly, we quantify how the dataset size growth, virtualization, and colocation affect page walk latency. We also study how a high page walk latency affects perform ance. Due to the lack of dedicated tools for evaluating address translation overhead on modern processors, we design a methodology to vary the page walk latency experienced by an application running on real hardware. To quantify the performance impact of address translation, we measure the application’s execution time while varying the page walk latency. We find that under virtualization, address translation considerably limits performance: an application can waste up to 68% of execution time due to stalls originating from page walks. In addition, we investigate which accesses from a nested page walk are most significant for the overall page walk latency by examining from where in the memory hierarchy these accesses are served. We find that accesses to the deeper levels of the page table radix tree are responsible for most of the overall page walk latency. Based on these observations, we introduce two address translation acceleration techniques that can be applied to any ISA that employs radix-tree page tables and nested page walks. The first of these techniques is Prefetched Address Translation (ASAP), a new software-hardware approach for mitigating the high page walk latency caused by virtualization and/or application colocation. At the heart of ASAP is a lightweight technique for directly indexing individual levels of the page table radix tree. Direct indexing enables ASAP to fetch nodes from deeper levels of the page table without first accessing the preceding levels, thus lowering the page walk latency. ASAP is fully compatible with the existing radix-tree-based page table and requires only incremental and isolated changes to the memory subsystem. The second technique is PTEMagnet, a new software-only approach for reducing address translation latency under virtualization and application colocation. Initially, we identify a new address translation bottleneck caused by memory fragmentation stemming from the interaction of virtualization, application colocation, and the Linux memory allocator. The fragmentation results in the effective cache footprint of the host page table being larger than that of the guest page table. The bloated footprint of the host page table leads to frequent cache misses during nested page walks, increasing page walk latency. In response to these observations, we propose PTEMag net. PTEMagnet prevents memory fragmentation by fine-grained reservation-based memory allocation in the guest OS. PTEMagnet is fully legacy-preserving, requiring no modifications to either user code or mechanisms for address translation and virtualization. In summary, this thesis proposes non-disruptive upgrades to the virtual memory subsystem for reducing page walk latency in virtualized deployments. In doing so, this thesis evaluates the impact of page walk latency on the application’s performance, identifies the bottlenecks of the existing address translation mechanism caused by virtualization, application colocation, and the Linux memory allocator, and proposes software-hardware and software-only solutions for eliminating the bottlenecks

    Page size aware cache prefetching

    Get PDF
    The increase in working set sizes of contemporary applications outpaces the growth in cache sizes, resulting in frequent main memory accesses that deteriorate system per- formance due to the disparity between processor and memory speeds. Prefetching data blocks into the cache hierarchy ahead of demand accesses has proven successful at attenuating this bottleneck. However, spatial cache prefetchers operating in the physical address space leave significant performance on the table by limiting their pattern detection within 4KB physical page boundaries when modern systems use page sizes larger than 4KB to mitigate the address translation overheads. This paper exploits the high usage of large pages in modern systems to increase the effectiveness of spatial cache prefetch- ing. We design and propose the Page-size Propagation Module (PPM), a µarchitectural scheme that propagates the page size information to the lower-level cache prefetchers, enabling safe prefetching beyond 4KB physical page boundaries when the accessed blocks reside in large pages, at the cost of augmenting the first-level caches’ Miss Status Holding Register (MSHR) entries with one additional bit. PPM is compatible with any cache prefetcher without implying design modifications. We capitalize on PPM’s benefits by designing a module that consists of two page size aware prefetchers that inherently use different page sizes to drive prefetching. The composite module uses adaptive logic to dynamically enable the most appropriate page size aware prefetcher. Finally, we show that the proposed designs are transparent to which cache prefetcher is used. We apply the proposed page size exploitation techniques to four state-of-the-art spatial cache prefetchers. Our evalua- tion shows that our proposals improve single-core geomean performance by up to 8.1% (2.1% at minimum) over the original implementation of the considered prefetchers, across 80 memory-intensive workloads. In multi-core contexts, we report geomean speedups up to 7.7% across different cache prefetchers and core configurations.This work is supported by the Spanish Ministry of Science and Technology through the PID2019-107255GB project, the Generalitat de Catalunya (contract 2017-SGR-1414), the European Union Horizon 2020 research and innovation program under grant agreement No 955606 (DEEP-SEA EU project), the National Science Foundation through grants CNS-1938064 and CCF-1912617, and the Semiconductor Research Corporation project GRC 2936.001. Georgios Vavouliotis has been supported by the Spanish Ministry of Economy, Industry, and Competitiveness and the European Social Fund under the FPI fellowship No. PRE2018-087046. Marc Casas has been partially supported by the Grant RYC2017-23269 funded by MCIN/AEI/10.13039/501100011033 and ESF ‘Investing in your future’.Peer ReviewedPostprint (author's final draft

    Utopia: Fast and Efficient Address Translation via Hybrid Restrictive & Flexible Virtual-to-Physical Address Mappings

    Full text link
    Conventional virtual memory (VM) frameworks enable a virtual address to flexibly map to any physical address. This flexibility necessitates large data structures to store virtual-to-physical mappings, which leads to high address translation latency and large translation-induced interference in the memory hierarchy. On the other hand, restricting the address mapping so that a virtual address can only map to a specific set of physical addresses can significantly reduce address translation overheads by using compact and efficient translation structures. However, restricting the address mapping flexibility across the entire main memory severely limits data sharing across different processes and increases data accesses to the swap space of the storage device, even in the presence of free memory. We propose Utopia, a new hybrid virtual-to-physical address mapping scheme that allows both flexible and restrictive hash-based address mapping schemes to harmoniously co-exist in the system. The key idea of Utopia is to manage physical memory using two types of physical memory segments: restrictive and flexible segments. A restrictive segment uses a restrictive, hash-based address mapping scheme that maps virtual addresses to only a specific set of physical addresses and enables faster address translation using compact translation structures. A flexible segment employs the conventional fully-flexible address mapping scheme. By mapping data to a restrictive segment, Utopia enables faster address translation with lower translation-induced interference. Utopia improves performance by 24% in a single-core system over the baseline system, whereas the best prior state-of-the-art contiguity-aware translation scheme improves performance by 13%.Comment: To appear in 56th IEEE/ACM International Symposium on Microarchitecture (MICRO), 202

    Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources

    Full text link
    Address translation is a performance bottleneck in data-intensive workloads due to large datasets and irregular access patterns that lead to frequent high-latency page table walks (PTWs). PTWs can be reduced by using (i) large hardware TLBs or (ii) large software-managed TLBs. Unfortunately, both solutions have significant drawbacks: increased access latency, power and area (for hardware TLBs), and costly memory accesses, the need for large contiguous memory blocks, and complex OS modifications (for software-managed TLBs). We present Victima, a new software-transparent mechanism that drastically increases the translation reach of the processor by leveraging the underutilized resources of the cache hierarchy. The key idea of Victima is to repurpose L2 cache blocks to store clusters of TLB entries, thereby providing an additional low-latency and high-capacity component that backs up the last-level TLB and thus reduces PTWs. Victima has two main components. First, a PTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on the frequency and cost of the PTWs they lead to. Second, a TLB-aware cache replacement policy prioritizes keeping TLB entries in the cache hierarchy by considering (i) the translation pressure (e.g., last-level TLB miss rate) and (ii) the reuse characteristics of the TLB entries. Our evaluation results show that in native (virtualized) execution environments Victima improves average end-to-end application performance by 7.4% (28.7%) over the baseline four-level radix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art software-managed TLB, across 11 diverse data-intensive workloads. Victima (i) is effective in both native and virtualized environments, (ii) is completely transparent to application and system software, and (iii) incurs very small area and power overheads on a modern high-end CPU.Comment: To appear in 56th IEEE/ACM International Symposium on Microarchitecture (MICRO), 202

    A Survey of Techniques for Architecting TLBs

    Get PDF
    “Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers

    A Survey on the Integration of NAND Flash Storage in the Design of File Systems and the Host Storage Software Stack

    Full text link
    With the ever-increasing amount of data generate in the world, estimated to reach over 200 Zettabytes by 2025, pressure on efficient data storage systems is intensifying. The shift from HDD to flash-based SSD provides one of the most fundamental shifts in storage technology, increasing performance capabilities significantly. However, flash storage comes with different characteristics than prior HDD storage technology. Therefore, storage software was unsuitable for leveraging the capabilities of flash storage. As a result, a plethora of storage applications have been design to better integrate with flash storage and align with flash characteristics. In this literature study we evaluate the effect the introduction of flash storage has had on the design of file systems, which providing one of the most essential mechanisms for managing persistent storage. We analyze the mechanisms for effectively managing flash storage, managing overheads of introduced design requirements, and leverage the capabilities of flash storage. Numerous methods have been adopted in file systems, however prominently revolve around similar design decisions, adhering to the flash hardware constrains, and limiting software intervention. Future design of storage software remains prominent with the constant growth in flash-based storage devices and interfaces, providing an increasing possibility to enhance flash integration in the host storage software stack
    corecore