Search CORE

5 research outputs found

Towards Fast, Adaptive, and Hardware-Assisted User-Space Scheduling

Author: Anderson Andy
Delimitrou Christina
Kaffes Kostis
Koufaty David
Lazarev Nikita
Li
Lisa
Suh Edward
Yin Yijun
Zhang Zhiru
Publication venue
Publication date: 11/11/2023
Field of study

Modern datacenter applications are prone to high tail latencies since their requests typically follow highly-dispersive distributions. Delivering fast interrupts is essential to reducing tail latency. Prior work has proposed both OS- and system-level solutions to reduce tail latencies for microsecond-scale workloads through better scheduling. Unfortunately, existing approaches like customized dataplane OSes, require significant OS changes, experience scalability limitations, or do not reach the full performance capabilities hardware offers. The emergence of new hardware features like UINTR exposed new opportunities to rethink the design paradigms and abstractions of traditional scheduling systems. We propose LibPreemptible, a preemptive user-level threading library that is flexible, lightweight, and adaptive. LibPreemptible was built with a set of optimizations like LibUtimer for scalability, and deadline-oriented API for flexible policies, time-quantum controller for adaptiveness. Compared to the prior state-of-the-art scheduling system Shinjuku, our system achieves significant tail latency and throughput improvements for various workloads without modifying the kernel. We also demonstrate the flexibility of LibPreemptible across scheduling policies for real applications experiencing varying load levels and characteristics.Comment: Accepted by HPCA202

arXiv.org e-Print Archive

RPCValet: NI-Driven Tail-Aware Balancing of µs-Scale RPCs

Author: Asanović Krste
Belay Adam
Bronson Nathan
Daglis Alexandros
Dragojevic Aleksandar
Gao Peter Xiang
Kaffes Kostis
Kalia Anuj
Kaufmann Antoine
Lim Hyeontaek
Linley Group
Linley Group
Linley Group
Pragaspathy Ilanthiraiyan
Rumble Stephen M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 31/05/2019
Field of study

Modern online services come with stringent quality requirements in terms of response time tail latency. Because of their decomposition into fine-grained communicating software layers, a single user request fans out into a plethora of short, μs-scale RPCs, aggravating the need for faster inter-server communication. In reaction to that need, we are witnessing a technological transition characterized by the emergence of hardware-terminated user-level protocols (e.g., InfiniBand/RDMA) and new architectures with fully integrated Network Interfaces (NIs). Such architectures offer a unique opportunity for a new NI-driven approach to balancing RPCs among the cores of manycore server CPUs, yielding major tail latency improvements for μs-scale RPCs. We introduce RPCValet, an NI-driven RPC load-balancing design for architectures with hardware-terminated protocols and integrated NIs, that delivers near-optimal tail latency. RPCValet's RPC dispatch decisions emulate the theoretically optimal single-queue system, without incurring synchronization overheads currently associated with single-queue implementations. Our design improves throughput under tight tail latency goals by up to 1.4x, and reduces tail latency before saturation by up to 4x for RPCs with μs-scale service times, as compared to current systems with hardware support for RPC load distribution. RPCValet performs within 15% of the theoretically optimal single-queue system

Infoscience - École polytechnique fédérale de Lausanne

Crossref

RSS++: load and state-aware receive side scaling

Author: Araújo João Taveira
Didona Diego
Gallo Massimo
Han Sangjin
Jeong EunYoung
Kaffes Kostis
Katsikas Georgios P.
Lim Hyeontaek
Menon Aravind
Olteanu Vladimir
Ousterhout Amy
Panda Aurojit
Rajagopalan Shriram
Wang Xiang
Yasukata Kenichi
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

While the current literature typically focuses on load-balancing among multiple servers, in this paper, we demonstrate the importance of load-balancing within a single machine (potentially with hundreds of CPU cores). In this context, we propose a new load-balancing technique (RSS++) that dynamically modifies the receive side scaling (RSS) indirection table to spread the load across the CPU cores in a more optimal way. RSS++ incurs up to 14x lower 95th percentile tail latency and orders of magnitude fewer packet drops compared to RSS under high CPU utilization. RSS++ allows higher CPU utilization and dynamic scaling of the number of allocated CPU cores to accommodate the input load, while avoiding the typical 25% over-provisioning. RSS++ has been implemented for both (i) DPDK and (ii) the Linux kernel. Additionally, we implement a new state migration technique, which facilitates sharding and reduces contention between CPU cores accessing per-flow data. RSS++ keeps the flow-state by groups that can be migrated at once, leading to a 20% higher efficiency than a state of the art shared flow table.QC 20191126</p

Publikationer från KTH

Crossref

Digitala Vetenskapliga Arkivet - Academic Archive On-line

DIAL UCLouvain

DBOS: a DBMS-oriented operating system

Author: Bestor David
Cafarella Michael
Gadepally Vijay
Graefe Goetz
Hong Daniel
Kaffes Kostis
Kepner Jeremy
Kozyrakis Christos
Kraft Peter
Kraska Tim
Li Qian
Mathew Shana
Skiadopoulos Athinagoras
Stonebraker Michael
Suresh Lalith
Zaharia Matei
Publication venue: 'VLDB Endowment'
Publication date: 14/07/2022
Field of study

This paper lays out the rationale for building a completely new operating system (OS) stack. Rather than build on a single node OS together with separate cluster schedulers, distributed filesystems, and network managers, we argue that a distributed transactional DBMS should be the basis for a scalable cluster OS. We show herein that such a database OS (DBOS) can do scheduling, file management, and inter-process communication with competitive performance to existing systems. In addition, significantly better analytics can be provided as well as a dramatic reduction in code complexity through implementing OS services as standard database queries, while implementing low-latency transactions and high availability only once.</jats:p

DSpace@MIT

HovercRaft: Achieving Scalability and Fault-tolerance for microsecond-scale Datacenter Services

Author: Baker Jason
Balakrishnan Mahesh
Barroso Luiz André
Benevides Bezerra Carlos Eduardo
Brewer Eric A.
Brian
Cary
Chang Chia-Chen
Dang Huynh Tu
Dang Huynh Tu
DeCandia Giuseppe
Dragojevic Aleksandar
Esposito Emanuele Giuseppe
Frans Kaashoek M.
Gao Peter Xiang
Guo Zhenyu
Han Sangjin
Hunt Patrick
István Zsolt
Jeong Eunyoung
Jim
Jin Xin
Kaffes Kostis
Kalia Anuj
Kapritsos Manos
Kogias Marios
Kogias Marios
Kulkarni Chinmay
Lamport Leslie
Li Jialin
Mao Yanhua
Michael
Mitchell Christopher
Moraru Iulian
Mu Shuai
Nishtala Rajesh
Ongaro Diego
Ousterhout Amy
Pascal
Poke Marius
Ports Dan R. K.
Prekas George
Psaroudakis Iraklis
Seo
Tokusashi Yuta
Weil Sage A.
Yang Jian
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 29/03/2020
Field of study

Cloud platform services must simultaneously be scalable, meet low tail latency service-level objectives, and be resilient to a combination of software, hardware, and network failures. Replication plays a fundamental role in meeting both the scalability and the fault-tolerance requirement, but is subject to opposing requirements: (1) scalability is typically achieved by relaxing consistency; (2) fault-tolerance is typically achieved through the consistent replication of state machines. Adding nodes to a system can therefore either in- crease performance at the expense of consistency, or increase resiliency at the expense of performance. We propose HovercRaft, a new approach by which adding nodes increases both the resilience and the performance of general-purpose state-machine replication. We achieve this through an extension of the Raft protocol that carefully eliminates CPU and I/O bottlenecks and load balances requests. Our implementation uses state-of-the-art kernel-bypass techniques, datacenter transport protocols, and in-network programmability to deliver up to 1 million operations/second for clusters of up to 9 nodes, linear speedup over unreplicated configuration for selected workloads, and a 4× speedup for the YCSBE-E benchmark running on Redis over an unreplicated deployment

Infoscience - École polytechnique fédérale de Lausanne

Crossref