21 research outputs found
Near-Memory Address Translation
Memory and logic integration on the same chip is becoming increasingly cost
effective, creating the opportunity to offload data-intensive functionality to
processing units placed inside memory chips. The introduction of memory-side
processing units (MPUs) into conventional systems faces virtual memory as the
first big showstopper: without efficient hardware support for address
translation MPUs have highly limited applicability. Unfortunately, conventional
translation mechanisms fall short of providing fast translations as
contemporary memories exceed the reach of TLBs, making expensive page walks
common.
In this paper, we are the first to show that the historically important
flexibility to map any virtual page to any page frame is unnecessary in today's
servers. We find that while limiting the associativity of the
virtual-to-physical mapping incurs no penalty, it can break the
translate-then-fetch serialization if combined with careful data placement in
the MPU's memory, allowing for translation and data fetch to proceed
independently and in parallel. We propose the Distributed Inverted Page Table
(DIPTA), a near-memory structure in which the smallest memory partition keeps
the translation information for its data share, ensuring that the translation
completes together with the data fetch. DIPTA completely eliminates the
performance overhead of translation, achieving speedups of up to 3.81x and
2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure
The Virtual Block Interface: A Flexible Alternative to the Conventional Virtual Memory Framework
Computers continue to diversify with respect to system designs, emerging
memory technologies, and application memory demands. Unfortunately, continually
adapting the conventional virtual memory framework to each possible system
configuration is challenging, and often results in performance loss or requires
non-trivial workarounds. To address these challenges, we propose a new virtual
memory framework, the Virtual Block Interface (VBI). We design VBI based on the
key idea that delegating memory management duties to hardware can reduce the
overheads and software complexity associated with virtual memory. VBI
introduces a set of variable-sized virtual blocks (VBs) to applications. Each
VB is a contiguous region of the globally-visible VBI address space, and an
application can allocate each semantically meaningful unit of information
(e.g., a data structure) in a separate VB. VBI decouples access protection from
memory allocation and address translation. While the OS controls which programs
have access to which VBs, dedicated hardware in the memory controller manages
the physical memory allocation and address translation of the VBs. This
approach enables several architectural optimizations to (1) efficiently and
flexibly cater to different and increasingly diverse system configurations, and
(2) eliminate key inefficiencies of conventional virtual memory. We demonstrate
the benefits of VBI with two important use cases: (1) reducing the overheads of
address translation (for both native execution and virtual machine
environments), as VBI reduces the number of translation requests and associated
memory accesses; and (2) two heterogeneous main memory architectures, where VBI
increases the effectiveness of managing fast memory regions. For both cases,
VBI significanttly improves performance over conventional virtual memory
Useful and efficient huge page management on the Linux kernel
Τα σύγχρονα workloads απαιτούν πολλή μνήμη για την εκτέλεσή τους, οδηγώντας τη βιομηχανία του computer hardware σε ανάπτυξη μνημών με συνεχόμενα αυξανόμενη μνήμη. Αυτή η αύξηση στην κατανάλωση μνήμης συνεπάγεται την ταυτόχρονη αύξηση στον αριθμό των virtual-to- physical address translations. Όλες αυτές οι μεταφράσεις περνάνε από το translation lookaside buffer (TLB), που αποτελει κομμάτι της μονάδας διαχείρισης μνήμης του επεξεργαστή, και έχει πεπερασμένο μέγεθος. Η αύξηση του αριθμού των address translations επιφέρει αύξηση των translation misses, το οποίο επιβαρύνει πολύ την απόδοση ενός προγράμματος και για αυτό έχουν προταθεί διάφορες λύσεις για την εξάλειψη του προβλήματος αυτού. Μία από τις πιο πολλά υποσχόμενες ιδέες ήταν η υποστήριξη μεγάλων σελίδων (huge pages), μεγέθους μεγαλύτερου των μέχρι τώρα χρησιμοποιούμενων, από το hardware με στόχο τη δραματική μείωση του αριθμού των address translations. Παρόλο που η υποστήριξη μεγάλων σελίδων προστέθηκε από τη δεκαετία του ’90 στους επεξεργαστές, μόνο οι πρόσφατες εκδόσεις υποστηρίζουν χιλιάδες εγγραφές στο TLB για μεγάλες σελίδες. Από πλευράς λογισμικου,́ οι σύγχρονες μέθοδοι χειρισμού των μεγάλων σελίδων επιφέρουν επιπρόσθετες καθυστερήσεις και αύξηση της απαιτούμενης μνήμης. Η πρόσφατη προσθήκη χιλιάδων εγγραφών για μεταφράσεις στο TLB κάνει επιτακτική την ανάγκη ανάπτυξης προηγμένων τεχνικών software για τον αποδοτικότερο χειρισμό των μεγάλων σελίδων.
Στη διπλωματική αυτή παρουσιάζουμε τα οφέλη και τα κόστη που συσχετίζονται με τη χρήση των μεγάλων σελίδων και δείχνουμε πως μπορούν να χρησιμοποιηθούν αποτελεσματικά. Με βάση όλα αυτά, σχεδιάσαμε ένα μηχανισμό διαχείρισης μνήμης για την υποστήριξη των μεγάλων σελίδων στον πυρήνα του Linux, ο οποίος βασίζεται σε έναν απλό μηχανισμό παρακολούθησης της χρήσης της μνήμης και σε ένα καινοτόμο αλγόριθμο συμπίεσης της μνήμης. Η αποτίμηση του συστήματος μας υποδεικνύει ότι αντιμετωπίζει αποτελεσματικά τα προβλήματα που συσχετίζονται με τη χρήση των μεγάλων σελίδων, ενώ ταυτόχρονα διατηρεί τα πλεονεκτήματα που προσφέρουν.Modern workloads consume a vast amount of memory, leading the computer hardware industry in manufacturing memories with ever-growing sizes. This increased memory consumption entails an increase in virtual-to-physical address translations, which occur in the translation lookaside buffer (TLB), a part of the CPU with limited size. Increased TLB translations, and consequently increased translation misses, hurt workloads' performance; multiple solutions were discussed in order to overcome this problem. A very promising idea was to add support in the hardware for pages with bigger sizes (huge pages), in order to reduce dramatically the number of address translation misses. Whilst the idea of huge pages was firstly introduced in the 90's, only recent processors have thousands of entries in the TLB for huge pages. At the software level, modern huge page management techniques lead to increased memory footprint and additional overheads; the recent addition of thousands of entries in the TLB makes the need to develop sophisticated huge page management approaches imperative, in order to use huge pages efficiently.
In this thesis we demonstrate the benefits and drawbacks of huge pages’ use and how they can be used efficiently. Based on that, we implement a framework for huge pages support on the Linux kernel, which uses basic tracking mechanisms and a novel memory compaction algorithm. The evaluation of our system indicates that we tackle effectively all problems that are associated with the use of huge pages, while maintaining the benefits
Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources
Address translation is a performance bottleneck in data-intensive workloads
due to large datasets and irregular access patterns that lead to frequent
high-latency page table walks (PTWs). PTWs can be reduced by using (i) large
hardware TLBs or (ii) large software-managed TLBs. Unfortunately, both
solutions have significant drawbacks: increased access latency, power and area
(for hardware TLBs), and costly memory accesses, the need for large contiguous
memory blocks, and complex OS modifications (for software-managed TLBs). We
present Victima, a new software-transparent mechanism that drastically
increases the translation reach of the processor by leveraging the
underutilized resources of the cache hierarchy. The key idea of Victima is to
repurpose L2 cache blocks to store clusters of TLB entries, thereby providing
an additional low-latency and high-capacity component that backs up the
last-level TLB and thus reduces PTWs. Victima has two main components. First, a
PTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on
the frequency and cost of the PTWs they lead to. Second, a TLB-aware cache
replacement policy prioritizes keeping TLB entries in the cache hierarchy by
considering (i) the translation pressure (e.g., last-level TLB miss rate) and
(ii) the reuse characteristics of the TLB entries. Our evaluation results show
that in native (virtualized) execution environments Victima improves average
end-to-end application performance by 7.4% (28.7%) over the baseline four-level
radix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art
software-managed TLB, across 11 diverse data-intensive workloads. Victima (i)
is effective in both native and virtualized environments, (ii) is completely
transparent to application and system software, and (iii) incurs very small
area and power overheads on a modern high-end CPU.Comment: To appear in 56th IEEE/ACM International Symposium on
Microarchitecture (MICRO), 202