Search CORE

8 research outputs found

Morsels: Explicit Virtual Memory Objects

Author: Dietrich Christian
Halbuer Alexander
Lohmann Daniel
Rommel Florian
Publication venue: New York, NY : Association for Computing Machinery
Publication date: 01/01/2023
Field of study

The tremendous growth of RAM capacity - now exceeding multiple terabytes - necessitates a reevaluation of traditional memory-management methods, which were developed when resources were scarce. Current virtual-memory subsystems handle address-space regions as sets of individual 4-KiB pages with demand paging and copy-on-write, resulting in significant management overhead. Although huge pages reduce the number of managed entities, they induce internal fragmentation and have a coarse copy granularity.To address these problems, we introduce Morsels, a novel virtual-memory-management paradigm that is purely based on hardware data structures and enables the efficient sharing of virtual-memory objects between processes and devices while being well suited for non-volatile memory. Our benchmarks show that Morsels reduce the mapping time for a 6.82-GiB machine-learning model by up to 99.8 percent compared to conventional memory mapping in Linux

Institutionelles Repositorium der Leibniz Universität Hannover

Useful and efficient huge page management on the Linux kernel

Author: Michailidis Theodoros
Μιχαηλίδης Θεόδωρος
Publication venue
Publication date: 01/01/2019
Field of study

Τα σύγχρονα workloads απαιτούν πολλή μνήμη για την εκτέλεσή τους, οδηγώντας τη βιομηχανία του computer hardware σε ανάπτυξη μνημών με συνεχόμενα αυξανόμενη μνήμη. Αυτή η αύξηση στην κατανάλωση μνήμης συνεπάγεται την ταυτόχρονη αύξηση στον αριθμό των virtual-to- physical address translations. Όλες αυτές οι μεταφράσεις περνάνε από το translation lookaside buffer (TLB), που αποτελει κομμάτι της μονάδας διαχείρισης μνήμης του επεξεργαστή, και έχει πεπερασμένο μέγεθος. Η αύξηση του αριθμού των address translations επιφέρει αύξηση των translation misses, το οποίο επιβαρύνει πολύ την απόδοση ενός προγράμματος και για αυτό έχουν προταθεί διάφορες λύσεις για την εξάλειψη του προβλήματος αυτού. Μία από τις πιο πολλά υποσχόμενες ιδέες ήταν η υποστήριξη μεγάλων σελίδων (huge pages), μεγέθους μεγαλύτερου των μέχρι τώρα χρησιμοποιούμενων, από το hardware με στόχο τη δραματική μείωση του αριθμού των address translations. Παρόλο που η υποστήριξη μεγάλων σελίδων προστέθηκε από τη δεκαετία του ’90 στους επεξεργαστές, μόνο οι πρόσφατες εκδόσεις υποστηρίζουν χιλιάδες εγγραφές στο TLB για μεγάλες σελίδες. Από πλευράς λογισμικου,́ οι σύγχρονες μέθοδοι χειρισμού των μεγάλων σελίδων επιφέρουν επιπρόσθετες καθυστερήσεις και αύξηση της απαιτούμενης μνήμης. Η πρόσφατη προσθήκη χιλιάδων εγγραφών για μεταφράσεις στο TLB κάνει επιτακτική την ανάγκη ανάπτυξης προηγμένων τεχνικών software για τον αποδοτικότερο χειρισμό των μεγάλων σελίδων. Στη διπλωματική αυτή παρουσιάζουμε τα οφέλη και τα κόστη που συσχετίζονται με τη χρήση των μεγάλων σελίδων και δείχνουμε πως μπορούν να χρησιμοποιηθούν αποτελεσματικά. Με βάση όλα αυτά, σχεδιάσαμε ένα μηχανισμό διαχείρισης μνήμης για την υποστήριξη των μεγάλων σελίδων στον πυρήνα του Linux, ο οποίος βασίζεται σε έναν απλό μηχανισμό παρακολούθησης της χρήσης της μνήμης και σε ένα καινοτόμο αλγόριθμο συμπίεσης της μνήμης. Η αποτίμηση του συστήματος μας υποδεικνύει ότι αντιμετωπίζει αποτελεσματικά τα προβλήματα που συσχετίζονται με τη χρήση των μεγάλων σελίδων, ενώ ταυτόχρονα διατηρεί τα πλεονεκτήματα που προσφέρουν.Modern workloads consume a vast amount of memory, leading the computer hardware industry in manufacturing memories with ever-growing sizes. This increased memory consumption entails an increase in virtual-to-physical address translations, which occur in the translation lookaside buffer (TLB), a part of the CPU with limited size. Increased TLB translations, and consequently increased translation misses, hurt workloads' performance; multiple solutions were discussed in order to overcome this problem. A very promising idea was to add support in the hardware for pages with bigger sizes (huge pages), in order to reduce dramatically the number of address translation misses. Whilst the idea of huge pages was firstly introduced in the 90's, only recent processors have thousands of entries in the TLB for huge pages. At the software level, modern huge page management techniques lead to increased memory footprint and additional overheads; the recent addition of thousands of entries in the TLB makes the need to develop sophisticated huge page management approaches imperative, in order to use huge pages efficiently. In this thesis we demonstrate the benefits and drawbacks of huge pages’ use and how they can be used efficiently. Based on that, we implement a framework for huge pages support on the Linux kernel, which uses basic tracking mechanisms and a novel memory compaction algorithm. The evaluation of our system indicates that we tackle effectively all problems that are associated with the use of huge pages, while maintaining the benefits

Pergamos : Unified Institutional Repository / Digital Library Platform of the National and Kapodistrian University of Athens

Making Huge Pages Actually Useful

Author: Bienia Christian
Bovet Daniel
Gorman Mel
Gorman Mel
Habib Irfan
Mauro Jim
McKenney Paul E
Sarma Dipankar
Subramanian Indira
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

The virtual-to-physical address translation overhead, a major performance bottleneck for modern workloads, can be effectively alleviated with huge pages. However, since huge pages must be mapped contiguously, OSs have not been able to use them well because of the memory fragmentation problem despite hardware support for huge pages being available for nearly two decades. This paper presents a comprehensive study of the interaction of fragmentation with huge pages in the Linux kernel. We observe that when huge pages are used, problems such as high CPU utilization and latency spikes occur because of unnecessary work (e.g., useless page migration) performed by memory management related subsystems due to the poor handling of unmovable (i.e., kernel) pages. This behavior is even more harmful in virtualized systems where unnecessary work may be performed in both guest and host OSs. We present Illuminator, an efficient memory manager that provides various subsystems, such as the page allocator, the ability to track all unmovable pages. It allows subsystems to make informed decisions and eliminate unnecessary work which in turn leads to cost-effective huge page allocations. Illuminator reduces the cost of compaction (up to 99%), improves application performance (up to 2.3x) and reduces the maximum latency of MySQL database server (by 30x)

Crossref

Datacenter Architectures for the Microservices Era

Author: Mirhosseininiri Seyedamirhossein
Publication venue
Publication date: 01/01/2021
Field of study

Modern internet services are shifting away from single-binary, monolithic services into numerous loosely-coupled microservices that interact via Remote Procedure Calls (RPCs), to improve programmability, reliability, manageability, and scalability of cloud services. Computer system designers are faced with many new challenges with microservice-based architectures, as individual RPCs/tasks are only a few microseconds in most microservices. In this dissertation, I seek to address the most notable challenges that arise due to the dissimilarities of the modern microservice based and classic monolithic cloud services, and design novel server architectures and runtime systems that enable efficient execution of µs-scale microservices on modern hardware. In the first part of my dissertation, I seek to address the problem of Killer Microseconds, which refers to µs-scale “holes” in CPU schedules caused by stalls to access fast I/O devices or brief idle times between requests in high throughput µs-scale microservices. Whereas modern computing platforms can efficiently hide ns-scale and ms-scale stalls through micro-architectural techniques and OS context switching, they lack efficient support to hide the latency of µs-scale stalls. In chapter II, I propose Duplexity, a heterogeneous server architecture that employs aggressive multithreading to hide the latency of killer microseconds, without sacrificing the Quality-of-Service (QoS) of latency-sensitive microservices. Duplexity is able to achieve 1.9× higher core utilization and 2.7× lower iso-throughput 99th-percentile tail latency over an SMT-based server design, on average. In chapters III-IV, I comprehensively investigate the problem of tail latency in the context of microservices and address multiple aspects of it. First, in chapter III, I characterize the tail latency behavior of microservices and provide general guidelines for optimizing computer systems from a queuing perspective to minimize tail latency. Queuing is a major contributor to end-to-end tail latency, wherein nominal tasks are enqueued behind rare, long ones, due to Head-of-Line (HoL) blocking. Next, in chapter IV, I introduce Q-Zilla, a scheduling framework to tackle tail latency from a queuing perspective, and CoreZilla, a microarchitectural instantiation of the framework. Q-Zilla is composed of the ServerQueue Decoupled Size-Interval Task Assignment (SQD-SITA) scheduling algorithm and the Express-lane Simultaneous Multithreading (ESMT) microarchitecture, which together seek to address HoL blocking by providing an “express-lane” for short tasks, protecting them from queuing behind rare, long ones. By combining the ESMT microarchitecture and the SQD-SITA scheduling algorithm, CoreZilla is able to improves tail latency over a conventional SMT core with 2, 4, and 8 contexts by 2.25×, 3.23×, and 4.38×, on average, respectively, and outperform a theoretical 32-core scale-up organization by 12%, on average, with 8 contexts. Finally, in chapters V-VI, I investigate the tail latency problem of microservices from a cluster, rather than server-level, perspective. Whereas Service Level Objectives (SLOs) define end-to-end latency targets for the entire service to ensure user satisfaction, with microservice-based applications, it is unclear how to scale individual microservices when end-to-end SLOs are violated or underutilized. I introduce Parslo as an analytical framework for partial SLO allocation in virtualized cloud microservices. Parslo takes a microservice graph as an input and employs a Gradient Descent-based approach to allocate “partial SLOs” to different microservice nodes, enabling independent auto-scaling of individual microservices. Parslo achieves the optimal solution, minimizing the total cost for the entire service deployment, and is applicable to general microservice graphs.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/167978/1/miramir_1.pd

Deep Blue Documents at the University of Michigan