87,931 research outputs found
Uncovering Bugs in Distributed Storage Systems during Testing (not in Production!)
Testing distributed systems is challenging due to multiple sources of nondeterminism. Conventional testing techniques, such as unit, integration and stress testing, are ineffective in preventing serious but subtle bugs from reaching production. Formal techniques, such as TLA+, can only verify high-level specifications of systems at the level of logic-based models, and fall short of checking the actual executable code. In this paper, we present a new methodology for testing distributed systems. Our approach applies advanced systematic testing techniques to thoroughly check that the executable code adheres to its high-level specifications, which significantly improves coverage of important system behaviors. Our methodology has been applied to three distributed storage systems in the Microsoft Azure cloud computing platform. In the process, numerous bugs were identified, reproduced, confirmed and fixed. These bugs required a subtle combination of concurrency and failures, making them extremely difficult to find with conventional testing techniques. An important advantage of our approach is that a bug is uncovered in a small setting and witnessed by a full system trace, which dramatically increases the productivity of debugging
Property-Based Testing - The ProTest Project
The ProTest project is an FP7 STREP on property based testing. The purpose of the project is to develop software engineering approaches to improve reliability of service-oriented networks; support fault-finding and diagnosis based on specified properties of the system. And to do so we will build automated tools that will generate and run tests, monitor execution at run-time, and log events for analysis.
The Erlang / Open Telecom Platform has been chosen as our initial implementation vehicle due to its robustness and reliability within the telecoms sector. It is noted for its success in the ATM telecoms switches by Ericsson, one of the project partners, as well as for multiple other uses such as in facebook, yahoo etc. In this paper we provide an overview of the project goals, as well as detailing initial progress in developing property based testing techniques and tools for the concurrent functional programming language Erlang
Context-aware Path Ranking for Knowledge Base Completion
Knowledge base (KB) completion aims to infer missing facts from existing ones
in a KB. Among various approaches, path ranking (PR) algorithms have received
increasing attention in recent years. PR algorithms enumerate paths between
entity pairs in a KB and use those paths as features to train a model for
missing fact prediction. Due to their good performances and high model
interpretability, several methods have been proposed. However, most existing
methods suffer from scalability (high RAM consumption) and feature explosion
(trains on an exponentially large number of features) problems. This paper
proposes a Context-aware Path Ranking (C-PR) algorithm to solve these problems
by introducing a selective path exploration strategy. C-PR learns global
semantics of entities in the KB using word embedding and leverages the
knowledge of entity semantics to enumerate contextually relevant paths using
bidirectional random walk. Experimental results on three large KBs show that
the path features (fewer in number) discovered by C-PR not only improve
predictive performance but also are more interpretable than existing baselines
Network emulation focusing on QoS-Oriented satellite communication
This chapter proposes network emulation basics and a complete case study of QoS-oriented Satellite Communication
Towards Data-Driven Autonomics in Data Centers
Continued reliance on human operators for managing data centers is a major
impediment for them from ever reaching extreme dimensions. Large computer
systems in general, and data centers in particular, will ultimately be managed
using predictive computational and executable models obtained through
data-science tools, and at that point, the intervention of humans will be
limited to setting high-level goals and policies rather than performing
low-level operations. Data-driven autonomics, where management and control are
based on holistic predictive models that are built and updated using generated
data, opens one possible path towards limiting the role of operators in data
centers. In this paper, we present a data-science study of a public Google
dataset collected in a 12K-node cluster with the goal of building and
evaluating a predictive model for node failures. We use BigQuery, the big data
SQL platform from the Google Cloud suite, to process massive amounts of data
and generate a rich feature set characterizing machine state over time. We
describe how an ensemble classifier can be built out of many Random Forest
classifiers each trained on these features, to predict if machines will fail in
a future 24-hour window. Our evaluation reveals that if we limit false positive
rates to 5%, we can achieve true positive rates between 27% and 88% with
precision varying between 50% and 72%. We discuss the practicality of including
our predictive model as the central component of a data-driven autonomic
manager and operating it on-line with live data streams (rather than off-line
on data logs). All of the scripts used for BigQuery and classification analyses
are publicly available from the authors' website.Comment: 12 pages, 6 figure
Crux: Locality-Preserving Distributed Services
Distributed systems achieve scalability by distributing load across many
machines, but wide-area deployments can introduce worst-case response latencies
proportional to the network's diameter. Crux is a general framework to build
locality-preserving distributed systems, by transforming an existing scalable
distributed algorithm A into a new locality-preserving algorithm ALP, which
guarantees for any two clients u and v interacting via ALP that their
interactions exhibit worst-case response latencies proportional to the network
latency between u and v. Crux builds on compact-routing theory, but generalizes
these techniques beyond routing applications. Crux provides weak and strong
consistency flavors, and shows latency improvements for localized interactions
in both cases, specifically up to several orders of magnitude for
weakly-consistent Crux (from roughly 900ms to 1ms). We deployed on PlanetLab
locality-preserving versions of a Memcached distributed cache, a Bamboo
distributed hash table, and a Redis publish/subscribe. Our results indicate
that Crux is effective and applicable to a variety of existing distributed
algorithms.Comment: 11 figure
- …