8 research outputs found
Productive Development of Scalable Network Functions with NFork
Despite decades of research, developing correct and scalable concurrent
programs is still challenging. Network functions (NFs) are not an exception.
This paper presents NFork, a system that helps NF domain experts to
productively develop concurrent NFs by abstracting away concurrency from
developers. The key scheme behind NFork's design is to exploit NF
characteristics to overcome the limitations of prior work on concurrency
programming. Developers write NFs as sequential programs, and during runtime,
NFork performs transparent parallelization by processing packets in different
cores. Exploiting NF characteristics, NFork leverages transactional memory and
develops efficient concurrent data structures to achieve scalability and
guarantee the absence of concurrency bugs.
Since NFork manages concurrency, it further provides (i) a profiler that
reveals the root causes of scalability bottlenecks inherent to the NF's
semantics and (ii) actionable recipes for developers to mitigate these root
causes by relaxing the NF's semantics. We show that NFs developed with NFork
achieve competitive scalability with those in Cisco VPP [16], and NFork's
profiler and recipes can effectively aid developers in optimizing NF
scalability.Comment: 16 pages, 8 figure
Scalable Range Locks for Scalable Address Spaces and Beyond
Range locks are a synchronization construct designed to provide concurrent
access to multiple threads (or processes) to disjoint parts of a shared
resource. Originally conceived in the file system context, range locks are
gaining increasing interest in the Linux kernel community seeking to alleviate
bottlenecks in the virtual memory management subsystem. The existing
implementation of range locks in the kernel, however, uses an internal spin
lock to protect the underlying tree structure that keeps track of acquired and
requested ranges. This spin lock becomes a point of contention on its own when
the range lock is frequently acquired. Furthermore, where and exactly how
specific (refined) ranges can be locked remains an open question.
In this paper, we make two independent, but related contributions. First, we
propose an alternative approach for building range locks based on linked lists.
The lists are easy to maintain in a lock-less fashion, and in fact, our range
locks do not use any internal locks in the common case. Second, we show how the
range of the lock can be refined in the mprotect operation through a
speculative mechanism. This refinement, in turn, allows concurrent execution of
mprotect operations on non-overlapping memory regions. We implement our new
algorithms and demonstrate their effectiveness in user-space and kernel-space,
achieving up to 9 speedup compared to the stock version of the Linux
kernel. Beyond the virtual memory management subsystem, we discuss other
applications of range locks in parallel software. As a concrete example, we
show how range locks can be used to facilitate the design of scalable
concurrent data structures, such as skip lists.Comment: 17 pages, 9 figures, Eurosys 202
Scaling synchronization primitives
Over the past decade, multicore machines have become the norm. A single machine is capable of having thousands of hardware threads or cores. Even cloud providers offer such
large multicore machines for data processing engines and databases. Thus, a fundamental question arises is how efficient are existing synchronization primitivesâ timestamping and lockingâthat developers use for designing concurrent, scalable, and performant applications. This dissertation focuses on understanding the scalability aspect of these primitives, and
presents new algorithms and approaches, that either leverage the hardware or the application
domain knowledge, to scale up to hundreds of cores. First, the thesis presents Ordo , a scalable ordering or timestamping primitive, that forms
the basis of designing scalable timestamp-based concurrency control mechanisms. Ordo relies on invariant hardware clocks and provides a notion of a globally synchronized clock
within a machine. We use the Ordo primitive to redesign a synchronization mechanism and concurrency control mechanisms in databases and software transactional memory. Later, this thesis focuses on the scalability aspect of locks in both virtualized and non-virtualized scenarios. In a virtualized environment, we identify that these locks suffer from
various preemption issues due to a semantic gap between the hypervisor shceduler and a virtual machine schedulerâthe double scheduling problem. We address this problem
by bridging this gap, in which both the hypervisor and virtual machines share minimal scheduling information to avoid the preemption problems. Finally, we focus on the design of lock algorithms in general. We find that locks in practice have discrepancies from locks in design. For example, popular spinlocks suffer from excessive cache-line bouncing in multicore (NUMA) systems, while state-of-the-art locks exhibit sub-par single-thread performance. We classify several dominating factors that impact the performance of lock algorithms. We then propose a new technique, shuffling, that can dynamically accommodate all these factors, without slowing down the critical path of the lock. The key idea of shuffling is to re-order the queue of threads waiting to acquire the lock with some pre-established policy. Using shuffling, we propose a family of locking algorithms, called SHFLLOCKS that respect all factors, efficiently utilize waiters, and achieve the best performance.Ph.D
Fuzzing file systems via two-dimensional input space exploration
File systems, a basic building block of an OS, are too big and too complex to be bug free. Nevertheless, file systems rely on regular stress-testing tools and formal checkers to find bugs, which are limited due to the ever-increasing complexity of both file systems and OSes. Thus, fuzzing, proven to be an effective and a practical approach, becomes a preferable choice, as it does not need much knowledge about a target. However, three main challenges exist in fuzzing file systems: mutating a large image blob that degrades overall performance, generating image-dependent file operations, and reproducing found bugs, which is difficult for existing OS fuzzers. Hence, we present JANUS, the first feedback-driven fuzzer that explores the two-dimensional input space of a file system, i.e., mutating metadata on a large image, while emitting image-directed file operations. In addition, JANUS relies on a library OS rather than on traditional VMs for fuzzing, which enables JANUS to load a fresh copy of the OS, thereby leading to better reproducibility of bugs. We evaluate JANUS on eight file systems and found 90 bugs in the upstream Linux kernel, 62 of which have been acknowledged. Forty-three bugs have been fixed with 32 CVEs assigned. In addition, JANUS achieves higher code coverage on all the file systems after fuzzing 12 hours, when compared with the state-of-the-art fuzzer Syzkaller for fuzzing file systems. JANUS visits 4.19x and 2.01x more code paths in Btrfs and ext4, respectively. Moreover, JANUS is able to reproduce 88-100% of the crashes, while Syzkaller fails on all of them
NrOS: Effective Replication and Sharing in an Operating System
Writing a correct operating system kernel is notoriously hard. Kernel code requires manual memory management and type-unsafe code and must efficiently handle complex, asynchronous events. In addition, increasing CPU core counts further complicate kernel development. Typically, monolithic kernels share state across cores and rely on one-off synchronization patterns that are specialized for each kernel structure or subsystem. Hence, kernel developers are constantly refining synchronization within OS kernels to improve scalability at the risk of introducing subtle bugs.We present NrOS, a new OS kernel with a safer approach to synchronization that runs many POSIX programs. NrOS is primarily constructed as a simple, sequential kernel with no concurrency, making it easier to develop and reason about its correctness. This kernel is scaled across NUMA nodes using node replication, a scheme inspired by state machine replication in distributed systems. NrOS replicates kernel state on each NUMA node and uses operation logs to maintain strong consistency between replicas. Cores can safely and concurrently read from their local kernel replica, eliminating remote NUMA accesses.Our evaluation shows that NrOS scales to 96 cores with performance that nearly always dominates Linux at scale, in some cases by orders of magnitude, while retaining much of the simplicity of a sequential kernel.RS3LA