8 research outputs found

    Firmament: Fast, Centralized Cluster Scheduling at Scale

    Get PDF
    Centralized datacenter schedulers can make high-quality placement decisions when scheduling tasks in a cluster. Today, however, high-quality placements come at the cost of high latency at scale, which degrades response time for interactive tasks and reduces cluster utilization. This paper describes Firmament, a centralized scheduler that scales to over ten thousand machines at sub- second placement latency even though it continuously reschedules all tasks via a min-cost max-flow (MCMF) optimization. Firmament achieves low latency by using multiple MCMF algorithms, by solving the problem incrementally, and via problem-specific optimizations. Experiments with a Google workload trace from a 12,500-machine cluster show that Firmament improves placement latency by 20 x over Quincy [22], a prior centralized scheduler using the same MCMF optimiza- tion. Moreover, even though Firmament is centralized, it matches the placement latency of distributed schedulers for workloads of short tasks. Finally, Firmament exceeds the placement quality of four widely-used central- ized and distributed schedulers on a real-world cluster, and hence improves batch task response time by 6 x.This work was supported by a Google European Doc- toral Fellowship, by NSF award CNS-1413920, and by the Defense Advanced Research Projects Agency (DARPA) and Air Force Research Laboratory (AFRL), under contract FA8750-11-C-0249

    Verifying Reliable Network Components in a Distributed Separation Logic with Dependent Separation Protocols

    Get PDF
    Publisher Copyright: © 2023 Owner/Author. We are grateful to Chet Murthy for helpful discussions. This work was supported in part by a Villum Investigator grant (no. 25804), Center for Basic Research in Program Verification (CPV), from the VILLUM Foundation.We present a foundationally verified implementation of a reliable communication library for asynchronous client-server communication, and a stack of formally verified components on top thereof. Our library is implemented in an OCaml-like language on top of UDP and features characteristic traits of existing protocols, such as a simple handshaking protocol, bidirectional channels, and retransmission/acknowledgement mechanisms. We verify the library in the Aneris distributed separation logic using a novel proof pattern - -dubbed the session escrow pattern - -based on the existing escrow proof pattern and the so-called dependent separation protocols, which hitherto have only been used in a non-distributed concurrent setting. We demonstrate how our specification of the reliable communication library simplifies formal reasoning about applications, such as a remote procedure call library, which we in turn use to verify a lazily replicated key-value store with leader-followers and clients thereof. Our development is highly modular - -each component is verified relative to specifications of the components it uses (not the implementation). All our results are formalized in the Coq proof assistant.publishersversionpublishe

    Towards Network-Accelerated Databases

    Get PDF
    Throughout the last years, data processing systems have seen substantial changes, notably moving towards disaggregation of resources. This shift separates compute and storage resources into distinct servers for better resource utilization, as they can now be scaled independently based on demand. This development is crucial for cloud-native Database Management Systems (DBMS), which mainly build on such disaggregated structures. This thesis examines two significant hardware trends in disaggregated architectures for DBMSs: modern networks and heterogeneous computing. Modern networks such as Remote Direct Memory Access (RDMA) are critical for efficient, high-throughput, low-latency data transfer, but present challenges for achieving optimal performance for DBMSs. The reason for this is that RDMA comes with a low-level interface with a plentitude of performance-critical aspects to consider. To address this challenge, this thesis introduces a high-level programming interface, the Data Flow Interface, specifically targeting the needs of data-intensive processing systems. In addition, this thesis highlights the emerging trend toward programmable network devices that offer data processing capabilities in the network. This trend is especially interesting for distributed DBMSs as they have to transfer large amounts of data over the network due to the disaggregated architecture, but also typical distributed data processing operations such as joins have to shuffle data between compute nodes. In the thesis, in-network processing devices are evaluated with typical DBMS operations to investigate the benefits and potential shortcomings. Another trend in the data center is the increasing heterogeneity of computing units such as GPUs and FPGAs due to their fast processing capabilities. Incorporating these heterogeneous devices into disaggregated architectures with fast networks has many merits. The reason is that specialized compute units can be exposed as network-attached disaggregated accelerator pools and thus provide flexible and scalable high-performance data processing. This integration of heterogeneous compute units and fast RDMA-capable networks is however non-trivial since networks like RDMA are typically not directly supported for devices besides CPUs and are as such non-trivial to integrate efficiently. The challenge of how to achieve efficient communication between different types of compute devices is addressed by proposing a network-driven communication scheme that leverages a programmable switch to carry out the network communication on behalf of the compute devices

    Cost- and workload-driven data management in the cloud

    Get PDF
    This thesis deals with the challenge of finding the right balance between consistency, availability, latency and costs, captured by the CAP/PACELC trade-offs, in the context of distributed data management in the Cloud. At the core of this work, cost and workload-driven data management protocols, called CCQ protocols, are developed. First, this includes the development of C3, which is an adaptive consistency protocol that is able to adjust consistency at runtime by considering consistency and inconsistency costs. Second, the development of Cumulus, an adaptive data partitioning protocol, that can adapt partitions by considering the application workload so that expensive distributed transactions are minimized or avoided. And third, the development of QuAD, a quorum-based replication protocol, that constructs the quorums in such a way so that, given a set of constraints, the best possible performance is achieved. The behavior of each CCQ protocol is steered by a cost model, which aims at reducing the costs and overhead for providing the desired data management guarantees. The CCQ protocols are able to continuously assess their behavior, and if necessary to adapt the behavior at runtime based on application workload and the cost model. This property is crucial for applications deployed in the Cloud, as they are characterized by a highly dynamic workload, and high scalability and availability demands. The dynamic adaptation of the behavior at runtime does not come for free, and may generate considerable overhead that might outweigh the gain of adaptation. The CCQ cost models incorporate a control mechanism, which aims at avoiding expensive and unnecessary adaptations, which do not provide any benefits to applications. The adaptation is a distributed activity that requires coordination between the sites in a distributed database system. The CCQ protocols implement safe online adaptation approaches, which exploit the properties of 2PC and 2PL to ensure that all sites behave in accordance with the cost model, even in the presence of arbitrary failures. It is crucial to guarantee a globally consistent view of the behavior, as in contrary the effects of the cost models are nullified. The presented protocols are implemented as part of a prototypical database system. Their modular architecture allows for a seamless extension of the optimization capabilities at any level of their implementation. Finally, the protocols are quantitatively evaluated in a series of experiments executed in a real Cloud environment. The results show their feasibility and ability to reduce application costs, and to dynamically adjust the behavior at runtime without violating their correctness
    corecore