1,889 research outputs found

    TANDEM: taming failures in next-generation datacenters with emerging memory

    Get PDF
    The explosive growth of online services, leading to unforeseen scales, has made modern datacenters highly prone to failures. Taming these failures hinges on fast and correct recovery, minimizing service interruptions. Applications, owing to recovery, entail additional measures to maintain a recoverable state of data and computation logic during their failure-free execution. However, these precautionary measures have severe implications on performance, correctness, and programmability, making recovery incredibly challenging to realize in practice. Emerging memory, particularly non-volatile memory (NVM) and disaggregated memory (DM), offers a promising opportunity to achieve fast recovery with maximum performance. However, incorporating these technologies into datacenter architecture presents significant challenges; Their distinct architectural attributes, differing significantly from traditional memory devices, introduce new semantic challenges for implementing recovery, complicating correctness and programmability. Can emerging memory enable fast, performant, and correct recovery in the datacenter? This thesis aims to answer this question while addressing the associated challenges. When architecting datacenters with emerging memory, system architects face four key challenges: (1) how to guarantee correct semantics; (2) how to efficiently enforce correctness with optimal performance; (3) how to validate end-to-end correctness including recovery; and (4) how to preserve programmer productivity (Programmability). This thesis aims to address these challenges through the following approaches: (a) defining precise consistency models that formally specify correct end-to-end semantics in the presence of failures (consistency models also play a crucial role in programmability); (b) developing new low-level mechanisms to efficiently enforce the prescribed models given the capabilities of emerging memory; and (c) creating robust testing frameworks to validate end-to-end correctness and recovery. We start our exploration with non-volatile memory (NVM), which offers fast persistence capabilities directly accessible through the processor’s load-store (memory) interface. Notably, these capabilities can be leveraged to enable fast recovery for Log-Free Data Structures (LFDs) while maximizing performance. However, due to the complexity of modern cache hierarchies, data hardly persist in any specific order, jeop- ardizing recovery and correctness. Therefore, recovery needs primitives that explicitly control the order of updates to NVM (known as persistency models). We outline the precise specification of a novel persistency model – Release Persistency (RP) – that provides a consistency guarantee for LFDs on what remains in non-volatile memory upon failure. To efficiently enforce RP, we propose a novel microarchitecture mechanism, lazy release persistence (LRP). Using standard LFDs benchmarks, we show that LRP achieves fast recovery while incurring minimal overhead on performance. We continue our discussion with memory disaggregation which decouples memory from traditional monolithic servers, offering a promising pathway for achieving very high availability in replicated in-memory data stores. Achieving such availability hinges on transaction protocols that can efficiently handle recovery in this setting, where compute and memory are independent. However, there is a challenge: disaggregated memory (DM) fails to work with RPC-style protocols, mandating one-sided transaction protocols. Exacerbating the problem, one-sided transactions expose critical low-level ordering to architects, posing a threat to correctness. We present a highly available transaction protocol, Pandora, that is specifically designed to achieve fast recovery in disaggregated key-value stores (DKVSes). Pandora is the first one-sided transactional protocol that ensures correct, non-blocking, and fast recovery in DKVS. Our experimental implementation artifacts demonstrate that Pandora achieves fast recovery and high availability while causing minimal disruption to services. Finally, we introduce a novel target litmus-testing framework – DART – to validate the end-to-end correctness of transactional protocols with recovery. Using DART’s target testing capabilities, we have found several critical bugs in Pandora, highlighting the need for robust end-to-end testing methods in the design loop to iteratively fix correctness bugs. Crucially, DART is lightweight and black-box, thereby eliminating any intervention from the programmers

    ENHANCING CLOUD SYSTEM RUNTIME TO ADDRESS COMPLEX FAILURES

    Get PDF
    As the reliance on cloud systems intensifies in our progressively digital world, understanding and reinforcing their reliability becomes more crucial than ever. Despite impressive advancements in augmenting the resilience of cloud systems, the growing incidence of complex failures now poses a substantial challenge to the availability of these systems. With cloud systems continuing to scale and increase in complexity, failures not only become more elusive to detect but can also lead to more catastrophic consequences. Such failures question the foundational premises of conventional fault-tolerance designs, necessitating the creation of novel system designs to counteract them. This dissertation aims to enhance distributed systems’ capabilities to detect, localize, and react to complex failures at runtime. To this end, this dissertation makes contributions to address three emerging categories of failures in cloud systems. The first part delves into the investigation of partial failures, introducing OmegaGen, a tool adept at generating tailored checkers for detecting and localizing such failures. The second part grapples with silent semantic failures prevalent in cloud systems, showcasing our study findings, and introducing Oathkeeper, a tool that leverages past failures to infer rules and expose these silent issues. The third part explores solutions to slow failures via RESIN, a framework specifically designed to detect, diagnose, and mitigate memory leaks in cloud-scale infrastructures, developed in collaboration with Microsoft Azure. The dissertation concludes by offering insights into future directions for the construction of reliable cloud systems

    Language Design for Reactive Systems: On Modal Models, Time, and Object Orientation in Lingua Franca and SCCharts

    Get PDF
    Reactive systems play a crucial role in the embedded domain. They continuously interact with their environment, handle concurrent operations, and are commonly expected to provide deterministic behavior to enable application in safety-critical systems. In this context, language design is a key aspect, since carefully tailored language constructs can aid in addressing the challenges faced in this domain, as illustrated by the various concurrency models that prevent the known pitfalls of regular threads. Today, many languages exist in this domain and often provide unique characteristics that make them specifically fit for certain use cases. This thesis evolves around two distinctive languages: the actor-oriented polyglot coordination language Lingua Franca and the synchronous statecharts dialect SCCharts. While they take different approaches in providing reactive modeling capabilities, they share clear similarities in their semantics and complement each other in design principles. This thesis analyzes and compares key design aspects in the context of these two languages. For three particularly relevant concepts, it provides and evaluates lean and seamless language extensions that are carefully aligned with the fundamental principles of the underlying language. Specifically, Lingua Franca is extended toward coordinating modal behavior, while SCCharts receives a timed automaton notation with an efficient execution model using dynamic ticks and an extension toward the object-oriented modeling paradigm

    UMSL Bulletin 2023-2024

    Get PDF
    The 2023-2024 Bulletin and Course Catalog for the University of Missouri St. Louis.https://irl.umsl.edu/bulletin/1088/thumbnail.jp

    UMSL Bulletin 2022-2023

    Get PDF
    The 2022-2023 Bulletin and Course Catalog for the University of Missouri St. Louis.https://irl.umsl.edu/bulletin/1087/thumbnail.jp

    Auditable and performant Byzantine consensus for permissioned ledgers

    Get PDF
    Permissioned ledgers allow users to execute transactions against a data store, and retain proof of their execution in a replicated ledger. Each replica verifies the transactions’ execution and ensures that, in perpetuity, a committed transaction cannot be removed from the ledger. Unfortunately, this is not guaranteed by today’s permissioned ledgers, which can be re-written if an arbitrary number of replicas collude. In addition, the transaction throughput of permissioned ledgers is low, hampering real-world deployments, by not taking advantage of multi-core CPUs and hardware accelerators. This thesis explores how permissioned ledgers and their consensus protocols can be made auditable in perpetuity; even when all replicas collude and re-write the ledger. It also addresses how Byzantine consensus protocols can be changed to increase the execution throughput of complex transactions. This thesis makes the following contributions: 1. Always auditable Byzantine consensus protocols. We present a permissioned ledger system that can assign blame to individual replicas regardless of how many of them misbehave. This is achieved by signing and storing consensus protocol messages in the ledger and providing clients with signed, universally-verifiable receipts. 2. Performant transaction execution with hardware accelerators. Next, we describe a cloud-based ML inference service that provides strong integrity guarantees, while staying compatible with current inference APIs. We change the Byzantine consensus protocol to execute machine learning (ML) inference computation on GPUs to optimize throughput and latency of ML inference computation. 3. Parallel transactions execution on multi-core CPUs. Finally, we introduce a permissioned ledger that executes transactions, in parallel, on multi-core CPUs. We separate the execution of transactions between the primary and secondary replicas. The primary replica executes transactions on multiple CPU cores and creates a dependency graph of the transactions that the backup replicas utilize to execute transactions in parallel.Open Acces

    Microcredentials to support PBL

    Get PDF

    Efficient Model Checking: The Power of Randomness

    Get PDF

    NCC: Natural Concurrency Control for Strictly Serializable Datastores by Avoiding the Timestamp-Inversion Pitfall

    Full text link
    Strictly serializable datastores greatly simplify the development of correct applications by providing strong consistency guarantees. However, existing techniques pay unnecessary costs for naturally consistent transactions, which arrive at servers in an order that is already strictly serializable. We find these transactions are prevalent in datacenter workloads. We exploit this natural arrival order by executing transaction requests with minimal costs while optimistically assuming they are naturally consistent, and then leverage a timestamp-based technique to efficiently verify if the execution is indeed consistent. In the process of designing such a timestamp-based technique, we identify a fundamental pitfall in relying on timestamps to provide strict serializability, and name it the timestamp-inversion pitfall. We find timestamp-inversion has affected several existing works. We present Natural Concurrency Control (NCC), a new concurrency control technique that guarantees strict serializability and ensures minimal costs -- i.e., one-round latency, lock-free, and non-blocking execution -- in the best (and common) case by leveraging natural consistency. NCC is enabled by three key components: non-blocking execution, decoupled response control, and timestamp-based consistency check. NCC avoids timestamp-inversion with a new technique: response timing control, and proposes two optimization techniques, asynchrony-aware timestamps and smart retry, to reduce false aborts. Moreover, NCC designs a specialized protocol for read-only transactions, which is the first to achieve the optimal best-case performance while ensuring strict serializability, without relying on synchronized clocks. Our evaluation shows that NCC outperforms state-of-the-art solutions by an order of magnitude on many workloads

    Automated and foundational verification of low-level programs

    Get PDF
    Formal verification is a promising technique to ensure the reliability of low-level programs like operating systems and hypervisors, since it can show the absence of whole classes of bugs and prevent critical vulnerabilities. However, to realize the full potential of formal verification for real-world low-level programs one has to overcome several challenges, including: (1) dealing with the complexities of realistic models of real-world programming languages; (2) ensuring the trustworthiness of the verification, ideally by providing foundational proofs (i.e., proofs that can be checked by a general-purpose proof assistant); and (3) minimizing the manual effort required for verification by providing a high degree of automation. This dissertation presents multiple projects that advance formal verification along these three axes: RefinedC provides the first approach for verifying C code that combines foundational proofs with a high degree of automation via a novel refinement and ownership type system. Islaris shows how to scale verification of assembly code to realistic models of modern instruction set architectures-in particular, Armv8-A and RISC-V. DimSum develops a decentralized approach for reasoning about programs that consist of components written in multiple different languages (e.g., assembly and C), as is common for low-level programs. RefinedC and Islaris rest on Lithium, a novel proof engine for separation logic that combines automation with foundational proofs.Formale Verifikation ist eine vielversprechende Technik, um die Verlässlichkeit von grundlegenden Programmen wie Betriebssystemen sicherzustellen. Um das volle Potenzial formaler Verifikation zu realisieren, müssen jedoch mehrere Herausforderungen gemeistert werden: Erstens muss die Komplexität von realistischen Modellen von Programmiersprachen wie C oder Assembler gehandhabt werden. Zweitens muss die Vertrauenswürdigkeit der Verifikation sichergestellt werden, idealerweise durch maschinenüberprüfbare Beweise. Drittens muss die Verifikation automatisiert werden, um den manuellen Aufwand zu minimieren. Diese Dissertation präsentiert mehrere Projekte, die formale Verifikation entlang dieser Achsen weiterentwickeln: RefinedC ist der erste Ansatz für die Verifikation von C Code, der maschinenüberprüfbare Beweise mit einem hohen Grad an Automatisierung vereint. Islaris zeigt, wie die Verifikation von Assembler zu realistischen Modellen von modernen Befehlssatzarchitekturen wie Armv8-A oder RISC-V skaliert werden kann. DimSum entwickelt einen neuen Ansatz für die Verifizierung von Programmen, die aus Komponenten in mehreren Programmiersprachen bestehen (z.B., C und Assembler), wie es oft bei grundlegenden Programmen wie Betriebssystemen der Fall ist. RefinedC und Islaris basieren auf Lithium, eine neue Automatisierungstechnik für Separationslogik, die maschinenüberprüfbare Beweise und Automatisierung verbindet.This research was supported in part by a Google PhD Fellowship, in part by awards from Android Security's ASPIRE program and from Google Research, and in part by a European Research Council (ERC) Consolidator Grant for the project "RustBelt", funded under the European Union’s Horizon 2020 Framework Programme (grant agreement no. 683289)
    • …
    corecore