2,094 research outputs found

    TANDEM: taming failures in next-generation datacenters with emerging memory

    Get PDF
    The explosive growth of online services, leading to unforeseen scales, has made modern datacenters highly prone to failures. Taming these failures hinges on fast and correct recovery, minimizing service interruptions. Applications, owing to recovery, entail additional measures to maintain a recoverable state of data and computation logic during their failure-free execution. However, these precautionary measures have severe implications on performance, correctness, and programmability, making recovery incredibly challenging to realize in practice. Emerging memory, particularly non-volatile memory (NVM) and disaggregated memory (DM), offers a promising opportunity to achieve fast recovery with maximum performance. However, incorporating these technologies into datacenter architecture presents significant challenges; Their distinct architectural attributes, differing significantly from traditional memory devices, introduce new semantic challenges for implementing recovery, complicating correctness and programmability. Can emerging memory enable fast, performant, and correct recovery in the datacenter? This thesis aims to answer this question while addressing the associated challenges. When architecting datacenters with emerging memory, system architects face four key challenges: (1) how to guarantee correct semantics; (2) how to efficiently enforce correctness with optimal performance; (3) how to validate end-to-end correctness including recovery; and (4) how to preserve programmer productivity (Programmability). This thesis aims to address these challenges through the following approaches: (a) defining precise consistency models that formally specify correct end-to-end semantics in the presence of failures (consistency models also play a crucial role in programmability); (b) developing new low-level mechanisms to efficiently enforce the prescribed models given the capabilities of emerging memory; and (c) creating robust testing frameworks to validate end-to-end correctness and recovery. We start our exploration with non-volatile memory (NVM), which offers fast persistence capabilities directly accessible through the processor’s load-store (memory) interface. Notably, these capabilities can be leveraged to enable fast recovery for Log-Free Data Structures (LFDs) while maximizing performance. However, due to the complexity of modern cache hierarchies, data hardly persist in any specific order, jeop- ardizing recovery and correctness. Therefore, recovery needs primitives that explicitly control the order of updates to NVM (known as persistency models). We outline the precise specification of a novel persistency model – Release Persistency (RP) – that provides a consistency guarantee for LFDs on what remains in non-volatile memory upon failure. To efficiently enforce RP, we propose a novel microarchitecture mechanism, lazy release persistence (LRP). Using standard LFDs benchmarks, we show that LRP achieves fast recovery while incurring minimal overhead on performance. We continue our discussion with memory disaggregation which decouples memory from traditional monolithic servers, offering a promising pathway for achieving very high availability in replicated in-memory data stores. Achieving such availability hinges on transaction protocols that can efficiently handle recovery in this setting, where compute and memory are independent. However, there is a challenge: disaggregated memory (DM) fails to work with RPC-style protocols, mandating one-sided transaction protocols. Exacerbating the problem, one-sided transactions expose critical low-level ordering to architects, posing a threat to correctness. We present a highly available transaction protocol, Pandora, that is specifically designed to achieve fast recovery in disaggregated key-value stores (DKVSes). Pandora is the first one-sided transactional protocol that ensures correct, non-blocking, and fast recovery in DKVS. Our experimental implementation artifacts demonstrate that Pandora achieves fast recovery and high availability while causing minimal disruption to services. Finally, we introduce a novel target litmus-testing framework – DART – to validate the end-to-end correctness of transactional protocols with recovery. Using DART’s target testing capabilities, we have found several critical bugs in Pandora, highlighting the need for robust end-to-end testing methods in the design loop to iteratively fix correctness bugs. Crucially, DART is lightweight and black-box, thereby eliminating any intervention from the programmers

    ENHANCING CLOUD SYSTEM RUNTIME TO ADDRESS COMPLEX FAILURES

    Get PDF
    As the reliance on cloud systems intensifies in our progressively digital world, understanding and reinforcing their reliability becomes more crucial than ever. Despite impressive advancements in augmenting the resilience of cloud systems, the growing incidence of complex failures now poses a substantial challenge to the availability of these systems. With cloud systems continuing to scale and increase in complexity, failures not only become more elusive to detect but can also lead to more catastrophic consequences. Such failures question the foundational premises of conventional fault-tolerance designs, necessitating the creation of novel system designs to counteract them. This dissertation aims to enhance distributed systems’ capabilities to detect, localize, and react to complex failures at runtime. To this end, this dissertation makes contributions to address three emerging categories of failures in cloud systems. The first part delves into the investigation of partial failures, introducing OmegaGen, a tool adept at generating tailored checkers for detecting and localizing such failures. The second part grapples with silent semantic failures prevalent in cloud systems, showcasing our study findings, and introducing Oathkeeper, a tool that leverages past failures to infer rules and expose these silent issues. The third part explores solutions to slow failures via RESIN, a framework specifically designed to detect, diagnose, and mitigate memory leaks in cloud-scale infrastructures, developed in collaboration with Microsoft Azure. The dissertation concludes by offering insights into future directions for the construction of reliable cloud systems

    Language Design for Reactive Systems: On Modal Models, Time, and Object Orientation in Lingua Franca and SCCharts

    Get PDF
    Reactive systems play a crucial role in the embedded domain. They continuously interact with their environment, handle concurrent operations, and are commonly expected to provide deterministic behavior to enable application in safety-critical systems. In this context, language design is a key aspect, since carefully tailored language constructs can aid in addressing the challenges faced in this domain, as illustrated by the various concurrency models that prevent the known pitfalls of regular threads. Today, many languages exist in this domain and often provide unique characteristics that make them specifically fit for certain use cases. This thesis evolves around two distinctive languages: the actor-oriented polyglot coordination language Lingua Franca and the synchronous statecharts dialect SCCharts. While they take different approaches in providing reactive modeling capabilities, they share clear similarities in their semantics and complement each other in design principles. This thesis analyzes and compares key design aspects in the context of these two languages. For three particularly relevant concepts, it provides and evaluates lean and seamless language extensions that are carefully aligned with the fundamental principles of the underlying language. Specifically, Lingua Franca is extended toward coordinating modal behavior, while SCCharts receives a timed automaton notation with an efficient execution model using dynamic ticks and an extension toward the object-oriented modeling paradigm

    Teleoperation Methods for High-Risk, High-Latency Environments

    Get PDF
    In-Space Servicing, Assembly, and Manufacturing (ISAM) can enable larger-scale and longer-lived infrastructure projects in space, with interest ranging from commercial entities to the US government. Servicing, in particular, has the potential to vastly increase the usable lifetimes of satellites. However, the vast majority of spacecraft on low Earth orbit today were not designed to be serviced on-orbit. As such, several of the manipulations during servicing cannot easily be automated and instead require ground-based teleoperation. Ground-based teleoperation of on-orbit robots brings its own challenges of high latency communications, with telemetry delays of several seconds, and difficulties in visualizing the remote environment due to limited camera views. We explore teleoperation methods to alleviate these difficulties, increase task success, and reduce operator load. First, we investigate a model-based teleoperation interface intended to provide the benefits of direct teleoperation even in the presence of time delay. We evaluate the model-based teleoperation method using professional robot operators, then use feedback from that study to inform the design of a visual planning tool for this task, Interactive Planning and Supervised Execution (IPSE). We describe and evaluate the IPSE system and two interfaces, one 2D using a traditional mouse and keyboard and one 3D using an Intuitive Surgical da Vinci master console. We then describe and evaluate an alternative 3D interface using a Meta Quest head-mounted display. Finally, we describe an extension of IPSE to allow human-in-the-loop planning for a redundant robot. Overall, we find that IPSE improves task success rate and decreases operator workload compared to a conventional teleoperation interface

    Towards a centralized multicore automotive system

    Get PDF
    Today’s automotive systems are inundated with embedded electronics to host chassis, powertrain, infotainment, advanced driver assistance systems, and other modern vehicle functions. As many as 100 embedded microcontrollers execute hundreds of millions of lines of code in a single vehicle. To control the increasing complexity in vehicle electronics and services, automakers are planning to consolidate different on-board automotive functions as software tasks on centralized multicore hardware platforms. However, these vehicle software services have different and contrasting timing, safety, and security requirements. Existing vehicle operating systems are ill-equipped to provide all the required service guarantees on a single machine. A centralized automotive system aims to tackle this by assigning software tasks to multiple criticality domains or levels according to their consequences of failures, or international safety standards like ISO 26262. This research investigates several emerging challenges in time-critical systems for a centralized multicore automotive platform and proposes a novel vehicle operating system framework to address them. This thesis first introduces an integrated vehicle management system (VMS), called DriveOS™, for a PC-class multicore hardware platform. Its separation kernel design enables temporal and spatial isolation among critical and non-critical vehicle services in different domains on the same machine. Time- and safety-critical vehicle functions are implemented in a sandboxed Real-time Operating System (OS) domain, and non-critical software is developed in a sandboxed general-purpose OS (e.g., Linux, Android) domain. To leverage the advantages of model-driven vehicle function development, DriveOS provides a multi-domain application framework in Simulink. This thesis also presents a real-time task pipeline scheduling algorithm in multiprocessors for communication between connected vehicle services with end-to-end guarantees. The benefits and performance of the overall automotive system framework are demonstrated with hardware-in-the-loop testing using real-world applications, car datasets and simulated benchmarks, and with an early-stage deployment in a production-grade luxury electric vehicle

    Design and Implementation of a Portable Framework for Application Decomposition and Deployment in Edge-Cloud Systems

    Get PDF
    The emergence of cyber-physical systems has brought about a significant increase in complexity and heterogeneity in the infrastructure on which these systems are deployed. One particular example of this complexity is the interplay between cloud, fog, and edge computing. However, the complexity of these systems can pose challenges when it comes to implementing self-organizing mechanisms, which are often designed to work on flat networks. Therefore, it is essential to separate the application logic from the specific deployment aspects to promote reusability and flexibility in infrastructure exploitation. To address this issue, a novel approach called "pulverization" has been proposed. This approach involves breaking down the system into smaller computational units, which can then be deployed on the available infrastructure. In this thesis, the design and implementation of a portable framework that enables the "pulverization" of cyber-physical systems are presented. The main objective of the framework is to pave the way for the deployment of cyber-physical systems in the edge-cloud continuum by reducing the complexity of the infrastructure and exploit opportunistically the heterogeneous resources available on it. Different scenarios are presented to highlight the effectiveness of the framework in different heterogeneous infrastructures and devices. Current limitations and future work are examined to identify improvement areas for the framework

    Distributed System Fuzzing

    Full text link
    Grey-box fuzzing is the lightweight approach of choice for finding bugs in sequential programs. It provides a balance between efficiency and effectiveness by conducting a biased random search over the domain of program inputs using a feedback function from observed test executions. For distributed system testing, however, the state-of-practice is represented today by only black-box tools that do not attempt to infer and exploit any knowledge of the system's past behaviours to guide the search for bugs. In this work, we present Mallory: the first framework for grey-box fuzz-testing of distributed systems. Unlike popular black-box distributed system fuzzers, such as Jepsen, that search for bugs by randomly injecting network partitions and node faults or by following human-defined schedules, Mallory is adaptive. It exercises a novel metric to learn how to maximize the number of observed system behaviors by choosing different sequences of faults, thus increasing the likelihood of finding new bugs. The key enablers for our approach are the new ideas of timeline-driven testing and timeline abstraction that provide the feedback function guiding a biased random search for failures. Mallory dynamically constructs Lamport timelines of the system behaviour, abstracts these timelines into happens-before summaries, and introduces faults guided by its real-time observation of the summaries. We have evaluated Mallory on a diverse set of widely-used industrial distributed systems. Compared to the start-of-the-art black-box fuzzer Jepsen, Mallory explores more behaviours and takes less time to find bugs. Mallory discovered 22 zero-day bugs (of which 18 were confirmed by developers), including 10 new vulnerabilities, in rigorously-tested distributed systems such as Braft, Dqlite, and Redis. 6 new CVEs have been assigned

    SUTMS - Unified Threat Management Framework for Home Networks

    Get PDF
    Home networks were initially designed for web browsing and non-business critical applications. As infrastructure improved, internet broadband costs decreased, and home internet usage transferred to e-commerce and business-critical applications. Today’s home computers host personnel identifiable information and financial data and act as a bridge to corporate networks via remote access technologies like VPN. The expansion of remote work and the transition to cloud computing have broadened the attack surface for potential threats. Home networks have become the extension of critical networks and services, hackers can get access to corporate data by compromising devices attacked to broad- band routers. All these challenges depict the importance of home-based Unified Threat Management (UTM) systems. There is a need of unified threat management framework that is developed specifically for home and small networks to address emerging security challenges. In this research, the proposed Smart Unified Threat Management (SUTMS) framework serves as a comprehensive solution for implementing home network security, incorporating firewall, anti-bot, intrusion detection, and anomaly detection engines into a unified system. SUTMS is able to provide 99.99% accuracy with 56.83% memory improvements. IPS stands out as the most resource-intensive UTM service, SUTMS successfully reduces the performance overhead of IDS by integrating it with the flow detection mod- ule. The artifact employs flow analysis to identify network anomalies and categorizes encrypted traffic according to its abnormalities. SUTMS can be scaled by introducing optional functions, i.e., routing and smart logging (utilizing Apriori algorithms). The research also tackles one of the limitations identified by SUTMS through the introduction of a second artifact called Secure Centralized Management System (SCMS). SCMS is a lightweight asset management platform with built-in security intelligence that can seamlessly integrate with a cloud for real-time updates

    Tradition and Innovation in Construction Project Management

    Get PDF
    This book is a reprint of the Special Issue 'Tradition and Innovation in Construction Project Management' that was published in the journal Buildings

    Design of autonomous robotic system for removal of porcupine crab spines

    Get PDF
    Among various types of crabs, the porcupine crab is recognized as a highly potential crab meat resource near the off-shore northwest Atlantic ocean. However, their long, sharp spines make it difficult to be manually handled. Despite the fact that automation technology is widely employed in the commercial seafood processing industry, manual processing methods still dominate in today’s crab processing, which causes low production rates and high manufacturing costs. This thesis proposes a novel robot-based porcupine crab spine removal method. Based on the 2D image and 3D point cloud data captured by the Microsoft Azure Kinect 3D RGB-D camera, the crab’s 3D point cloud model can be reconstructed by using the proposed point cloud processing method. After that, the novel point cloud slicing method and the 2D image and 3D point cloud combination methods are proposed to generate the robot spine removal trajectory. The 3D model of the crab with the actual dimension, robot working cell, and endeffector are well established in Solidworks [1] and imported into the Robot Operating System (ROS) [2] simulation environment for methodology validation and design optimization. The simulation results show that both the point cloud slicing method and the 2D and 3D combination methods can generate a smooth and feasible trajectory. Moreover, compared with the point cloud slicing method, the 2D and 3D combination method is more precise and efficient, which has been validated in the real experiment environment. The automated experiment platform, featuring a 3D-printed end-effector and crab model, has been successfully set up. Results from the experiments indicate that the crab model can be accurately reconstructed, and the central line equations of each spine were calculated to generate a spine removal trajectory. Upon execution with a real robot arm, all spines were removed successfully. This thesis demonstrates the proposed method’s capability to achieve expected results and its potential for application in various manufacturing processes such as painting, polishing, and deburring for parts of different shapes and materials
    • …
    corecore