78 research outputs found
OpLog: a library for scaling update-heavy data structures
Existing techniques (e.g., RCU) can achieve good multi-core scaling for read-mostly data, but for update-heavy data structures only special-purpose techniques exist. This paper presents OpLog, a general-purpose library supporting good scalability for update-heavy data structures. OpLog achieves scalability by logging each update in a low-contention per-core log; it combines logs only when required by a read to the data structure. OpLog achieves generality by logging operations without having to understand them, to ease application to existing data structures. OpLog can further increase performance if the programmer indicates which operations can be combined in the logs. An evaluation shows how to apply OpLog to three update-heavy Linux kernel data structures. Measurements on a 48-core AMD server show that the result significantly improves the performance of the Apache web server and the Exim mail server under certain workloads
Automated Detection of Serializability Violations Under Weak Consistency
While a number of weak consistency mechanisms have been developed in recent years to improve performance and ensure availability in distributed, replicated systems, ensuring the correctness of transactional applications running on top of such systems remains a difficult and important problem. Serializability is a well-understood correctness criterion for transactional programs; understanding whether applications are serializable when executed in a weakly-consistent environment, however remains a challenging exercise. In this work, we combine a dependency graph-based characterization of serializability and leverage the framework of abstract executions to develop a fully-automated approach for statically finding bounded serializability violations under any weak consistency model. We reduce the problem of serializability to satisfiability of a formula in First-Order Logic (FOL), which allows us to harness the power of existing SMT solvers. We provide rules to automatically construct the FOL encoding from programs written in SQL (allowing loops and conditionals) and express consistency specifications as FOL formula. In addition to detecting bounded serializability violations, we also provide two orthogonal schemes to reason about unbounded executions by providing sufficient conditions (again, in the form of FOL formulae) whose satisfiability implies the absence of anomalies in any arbitrary execution. We have applied the proposed technique on TPC-C, a real-world database program with complex application logic, and were able to discover anomalies under Parallel Snapshot Isolation (PSI), and verify serializability for unbounded executions under Snapshot Isolation (SI), two consistency mechanisms substantially weaker than serializability
Distributed Multi-writer Multi-reader Atomic Register with Optimistically Fast Read and Write
A distributed multi-writer multi-reader (MWMR) atomic register is an
important primitive that enables a wide range of distributed algorithms. Hence,
improving its performance can have large-scale consequences. Since the seminal
work of ABD emulation in the message-passing networks [JACM '95], many
researchers study fast implementations of atomic registers under various
conditions. "Fast" means that a read or a write can be completed with 1
round-trip time (RTT), by contacting a simple majority. In this work, we
explore an atomic register with optimal resilience and "optimistically fast"
read and write operations. That is, both operations can be fast if there is no
concurrent write.
This paper has three contributions: (i) We present Gus, the emulation of an
MWMR atomic register with optimal resilience and optimistically fast reads and
writes when there are up to 5 nodes; (ii) We show that when there are > 5
nodes, it is impossible to emulate an MWMR atomic register with both
properties; and (iii) We implement Gus in the framework of EPaxos and Gryff,
and show that Gus provides lower tail latency than state-of-the-art systems
such as EPaxos, Gryff, Giza, and Tempo under various workloads in the context
of geo-replicated object storage systems
RackBlox: A Software-Defined Rack-Scale Storage System with Network-Storage Co-Design
Software-defined networking (SDN) and software-defined flash (SDF) have been
serving as the backbone of modern data centers. They are managed separately to
handle I/O requests. At first glance, this is a reasonable design by following
the rack-scale hierarchical design principles. However, it suffers from
suboptimal end-to-end performance, due to the lack of coordination between SDN
and SDF.
In this paper, we co-design the SDN and SDF stack by redefining the functions
of their control plane and data plane, and splitting up them within a new
architecture named RackBlox. RackBlox decouples the storage management
functions of flash-based solid-state drives (SSDs), and allow the SDN to track
and manage the states of SSDs in a rack. Therefore, we can enable the state
sharing between SDN and SDF, and facilitate global storage resource management.
RackBlox has three major components: (1) coordinated I/O scheduling, in which
it dynamically adjusts the I/O scheduling in the storage stack with the
measured and predicted network latency, such that it can coordinate the effort
of I/O scheduling across the network and storage stack for achieving
predictable end-to-end performance; (2) coordinated garbage collection (GC), in
which it will coordinate the GC activities across the SSDs in a rack to
minimize their impact on incoming I/O requests; (3) rack-scale wear leveling,
in which it enables global wear leveling among SSDs in a rack by periodically
swapping data, for achieving improved device lifetime for the entire rack. We
implement RackBlox using programmable SSDs and switch. Our experiments
demonstrate that RackBlox can reduce the tail latency of I/O requests by up to
5.8x over state-of-the-art rack-scale storage systems.Comment: 14 pages. Published in published in ACM SIGOPS 29th Symposium on
Operating Systems Principles (SOSP'23
Non-Uniform Replication
Replication is a key technique in the design of efficient and reliable distributed systems. As information grows, it becomes difficult or even impossible to store all information at every replica. A common approach to deal with this problem is to rely on partial replication, where each replica maintains only a part of the total system information. As a consequence, a remote replica might need to be contacted for computing the reply to some given query, which leads to high latency costs particularly in geo-replicated settings. In this work, we introduce the concept of non- uniform replication, where each replica stores only part of the information, but where all replicas store enough information to answer every query. We apply this concept to eventual consistency and conflict-free replicated data types. We show that this model can address useful problems and present two data types that solve such problems. Our evaluation shows that non-uniform replication is more efficient than traditional replication, using less storage space and network bandwidth
Understanding the Genetic Makeup of Linux Device Drivers
International audienceAttempts have been made to understand driver development in terms of code clones. In this paper, we propose an alternate view, based on the metaphor of a gene. Guided by this metaphor, we study the structure of Linux 3.10 ethernet platform driver probe functions
- …