21 research outputs found

    Near-Memory Address Translation

    Full text link
    Memory and logic integration on the same chip is becoming increasingly cost effective, creating the opportunity to offload data-intensive functionality to processing units placed inside memory chips. The introduction of memory-side processing units (MPUs) into conventional systems faces virtual memory as the first big showstopper: without efficient hardware support for address translation MPUs have highly limited applicability. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of TLBs, making expensive page walks common. In this paper, we are the first to show that the historically important flexibility to map any virtual page to any page frame is unnecessary in today's servers. We find that while limiting the associativity of the virtual-to-physical mapping incurs no penalty, it can break the translate-then-fetch serialization if combined with careful data placement in the MPU's memory, allowing for translation and data fetch to proceed independently and in parallel. We propose the Distributed Inverted Page Table (DIPTA), a near-memory structure in which the smallest memory partition keeps the translation information for its data share, ensuring that the translation completes together with the data fetch. DIPTA completely eliminates the performance overhead of translation, achieving speedups of up to 3.81x and 2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure

    The Virtual Block Interface: A Flexible Alternative to the Conventional Virtual Memory Framework

    Full text link
    Computers continue to diversify with respect to system designs, emerging memory technologies, and application memory demands. Unfortunately, continually adapting the conventional virtual memory framework to each possible system configuration is challenging, and often results in performance loss or requires non-trivial workarounds. To address these challenges, we propose a new virtual memory framework, the Virtual Block Interface (VBI). We design VBI based on the key idea that delegating memory management duties to hardware can reduce the overheads and software complexity associated with virtual memory. VBI introduces a set of variable-sized virtual blocks (VBs) to applications. Each VB is a contiguous region of the globally-visible VBI address space, and an application can allocate each semantically meaningful unit of information (e.g., a data structure) in a separate VB. VBI decouples access protection from memory allocation and address translation. While the OS controls which programs have access to which VBs, dedicated hardware in the memory controller manages the physical memory allocation and address translation of the VBs. This approach enables several architectural optimizations to (1) efficiently and flexibly cater to different and increasingly diverse system configurations, and (2) eliminate key inefficiencies of conventional virtual memory. We demonstrate the benefits of VBI with two important use cases: (1) reducing the overheads of address translation (for both native execution and virtual machine environments), as VBI reduces the number of translation requests and associated memory accesses; and (2) two heterogeneous main memory architectures, where VBI increases the effectiveness of managing fast memory regions. For both cases, VBI significanttly improves performance over conventional virtual memory

    Useful and efficient huge page management on the Linux kernel

    Get PDF
    Τα σύγχρονα workloads απαιτούν πολλή μνήμη για την εκτέλεσή τους, οδηγώντας τη βιομηχανία του computer hardware σε ανάπτυξη μνημών με συνεχόμενα αυξανόμενη μνήμη. Αυτή η αύξηση στην κατανάλωση μνήμης συνεπάγεται την ταυτόχρονη αύξηση στον αριθμό των virtual-to- physical address translations. Όλες αυτές οι μεταφράσεις περνάνε από το translation lookaside buffer (TLB), που αποτελει κομμάτι της μονάδας διαχείρισης μνήμης του επεξεργαστή, και έχει πεπερασμένο μέγεθος. Η αύξηση του αριθμού των address translations επιφέρει αύξηση των translation misses, το οποίο επιβαρύνει πολύ την απόδοση ενός προγράμματος και για αυτό έχουν προταθεί διάφορες λύσεις για την εξάλειψη του προβλήματος αυτού. Μία από τις πιο πολλά υποσχόμενες ιδέες ήταν η υποστήριξη μεγάλων σελίδων (huge pages), μεγέθους μεγαλύτερου των μέχρι τώρα χρησιμοποιούμενων, από το hardware με στόχο τη δραματική μείωση του αριθμού των address translations. Παρόλο που η υποστήριξη μεγάλων σελίδων προστέθηκε από τη δεκαετία του ’90 στους επεξεργαστές, μόνο οι πρόσφατες εκδόσεις υποστηρίζουν χιλιάδες εγγραφές στο TLB για μεγάλες σελίδες. Από πλευράς λογισμικου,́ οι σύγχρονες μέθοδοι χειρισμού των μεγάλων σελίδων επιφέρουν επιπρόσθετες καθυστερήσεις και αύξηση της απαιτούμενης μνήμης. Η πρόσφατη προσθήκη χιλιάδων εγγραφών για μεταφράσεις στο TLB κάνει επιτακτική την ανάγκη ανάπτυξης προηγμένων τεχνικών software για τον αποδοτικότερο χειρισμό των μεγάλων σελίδων. Στη διπλωματική αυτή παρουσιάζουμε τα οφέλη και τα κόστη που συσχετίζονται με τη χρήση των μεγάλων σελίδων και δείχνουμε πως μπορούν να χρησιμοποιηθούν αποτελεσματικά. Με βάση όλα αυτά, σχεδιάσαμε ένα μηχανισμό διαχείρισης μνήμης για την υποστήριξη των μεγάλων σελίδων στον πυρήνα του Linux, ο οποίος βασίζεται σε έναν απλό μηχανισμό παρακολούθησης της χρήσης της μνήμης και σε ένα καινοτόμο αλγόριθμο συμπίεσης της μνήμης. Η αποτίμηση του συστήματος μας υποδεικνύει ότι αντιμετωπίζει αποτελεσματικά τα προβλήματα που συσχετίζονται με τη χρήση των μεγάλων σελίδων, ενώ ταυτόχρονα διατηρεί τα πλεονεκτήματα που προσφέρουν.Modern workloads consume a vast amount of memory, leading the computer hardware industry in manufacturing memories with ever-growing sizes. This increased memory consumption entails an increase in virtual-to-physical address translations, which occur in the translation lookaside buffer (TLB), a part of the CPU with limited size. Increased TLB translations, and consequently increased translation misses, hurt workloads' performance; multiple solutions were discussed in order to overcome this problem. A very promising idea was to add support in the hardware for pages with bigger sizes (huge pages), in order to reduce dramatically the number of address translation misses. Whilst the idea of huge pages was firstly introduced in the 90's, only recent processors have thousands of entries in the TLB for huge pages. At the software level, modern huge page management techniques lead to increased memory footprint and additional overheads; the recent addition of thousands of entries in the TLB makes the need to develop sophisticated huge page management approaches imperative, in order to use huge pages efficiently. In this thesis we demonstrate the benefits and drawbacks of huge pages’ use and how they can be used efficiently. Based on that, we implement a framework for huge pages support on the Linux kernel, which uses basic tracking mechanisms and a novel memory compaction algorithm. The evaluation of our system indicates that we tackle effectively all problems that are associated with the use of huge pages, while maintaining the benefits

    Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources

    Full text link
    Address translation is a performance bottleneck in data-intensive workloads due to large datasets and irregular access patterns that lead to frequent high-latency page table walks (PTWs). PTWs can be reduced by using (i) large hardware TLBs or (ii) large software-managed TLBs. Unfortunately, both solutions have significant drawbacks: increased access latency, power and area (for hardware TLBs), and costly memory accesses, the need for large contiguous memory blocks, and complex OS modifications (for software-managed TLBs). We present Victima, a new software-transparent mechanism that drastically increases the translation reach of the processor by leveraging the underutilized resources of the cache hierarchy. The key idea of Victima is to repurpose L2 cache blocks to store clusters of TLB entries, thereby providing an additional low-latency and high-capacity component that backs up the last-level TLB and thus reduces PTWs. Victima has two main components. First, a PTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on the frequency and cost of the PTWs they lead to. Second, a TLB-aware cache replacement policy prioritizes keeping TLB entries in the cache hierarchy by considering (i) the translation pressure (e.g., last-level TLB miss rate) and (ii) the reuse characteristics of the TLB entries. Our evaluation results show that in native (virtualized) execution environments Victima improves average end-to-end application performance by 7.4% (28.7%) over the baseline four-level radix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art software-managed TLB, across 11 diverse data-intensive workloads. Victima (i) is effective in both native and virtualized environments, (ii) is completely transparent to application and system software, and (iii) incurs very small area and power overheads on a modern high-end CPU.Comment: To appear in 56th IEEE/ACM International Symposium on Microarchitecture (MICRO), 202
    corecore