87,931 research outputs found

    Uncovering Bugs in Distributed Storage Systems during Testing (not in Production!)

    Get PDF
    Testing distributed systems is challenging due to multiple sources of nondeterminism. Conventional testing techniques, such as unit, integration and stress testing, are ineffective in preventing serious but subtle bugs from reaching production. Formal techniques, such as TLA+, can only verify high-level specifications of systems at the level of logic-based models, and fall short of checking the actual executable code. In this paper, we present a new methodology for testing distributed systems. Our approach applies advanced systematic testing techniques to thoroughly check that the executable code adheres to its high-level specifications, which significantly improves coverage of important system behaviors. Our methodology has been applied to three distributed storage systems in the Microsoft Azure cloud computing platform. In the process, numerous bugs were identified, reproduced, confirmed and fixed. These bugs required a subtle combination of concurrency and failures, making them extremely difficult to find with conventional testing techniques. An important advantage of our approach is that a bug is uncovered in a small setting and witnessed by a full system trace, which dramatically increases the productivity of debugging

    Property-Based Testing - The ProTest Project

    Get PDF
    The ProTest project is an FP7 STREP on property based testing. The purpose of the project is to develop software engineering approaches to improve reliability of service-oriented networks; support fault-finding and diagnosis based on specified properties of the system. And to do so we will build automated tools that will generate and run tests, monitor execution at run-time, and log events for analysis. The Erlang / Open Telecom Platform has been chosen as our initial implementation vehicle due to its robustness and reliability within the telecoms sector. It is noted for its success in the ATM telecoms switches by Ericsson, one of the project partners, as well as for multiple other uses such as in facebook, yahoo etc. In this paper we provide an overview of the project goals, as well as detailing initial progress in developing property based testing techniques and tools for the concurrent functional programming language Erlang

    Context-aware Path Ranking for Knowledge Base Completion

    Full text link
    Knowledge base (KB) completion aims to infer missing facts from existing ones in a KB. Among various approaches, path ranking (PR) algorithms have received increasing attention in recent years. PR algorithms enumerate paths between entity pairs in a KB and use those paths as features to train a model for missing fact prediction. Due to their good performances and high model interpretability, several methods have been proposed. However, most existing methods suffer from scalability (high RAM consumption) and feature explosion (trains on an exponentially large number of features) problems. This paper proposes a Context-aware Path Ranking (C-PR) algorithm to solve these problems by introducing a selective path exploration strategy. C-PR learns global semantics of entities in the KB using word embedding and leverages the knowledge of entity semantics to enumerate contextually relevant paths using bidirectional random walk. Experimental results on three large KBs show that the path features (fewer in number) discovered by C-PR not only improve predictive performance but also are more interpretable than existing baselines

    Network emulation focusing on QoS-Oriented satellite communication

    Get PDF
    This chapter proposes network emulation basics and a complete case study of QoS-oriented Satellite Communication

    Towards Data-Driven Autonomics in Data Centers

    Get PDF
    Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using generated data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating a predictive model for node failures. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing machine state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if machines will fail in a future 24-hour window. Our evaluation reveals that if we limit false positive rates to 5%, we can achieve true positive rates between 27% and 88% with precision varying between 50% and 72%. We discuss the practicality of including our predictive model as the central component of a data-driven autonomic manager and operating it on-line with live data streams (rather than off-line on data logs). All of the scripts used for BigQuery and classification analyses are publicly available from the authors' website.Comment: 12 pages, 6 figure

    Crux: Locality-Preserving Distributed Services

    Full text link
    Distributed systems achieve scalability by distributing load across many machines, but wide-area deployments can introduce worst-case response latencies proportional to the network's diameter. Crux is a general framework to build locality-preserving distributed systems, by transforming an existing scalable distributed algorithm A into a new locality-preserving algorithm ALP, which guarantees for any two clients u and v interacting via ALP that their interactions exhibit worst-case response latencies proportional to the network latency between u and v. Crux builds on compact-routing theory, but generalizes these techniques beyond routing applications. Crux provides weak and strong consistency flavors, and shows latency improvements for localized interactions in both cases, specifically up to several orders of magnitude for weakly-consistent Crux (from roughly 900ms to 1ms). We deployed on PlanetLab locality-preserving versions of a Memcached distributed cache, a Bamboo distributed hash table, and a Redis publish/subscribe. Our results indicate that Crux is effective and applicable to a variety of existing distributed algorithms.Comment: 11 figure
    corecore