5,859 research outputs found
Exploring heterogeneity of unreliable machines for p2p backup
P2P architecture is a viable option for enterprise backup. In contrast to
dedicated backup servers, nowadays a standard solution, making backups directly
on organization's workstations should be cheaper (as existing hardware is
used), more efficient (as there is no single bottleneck server) and more
reliable (as the machines are geographically dispersed).
We present the architecture of a p2p backup system that uses pairwise
replication contracts between a data owner and a replicator. In contrast to
standard p2p storage systems using directly a DHT, the contracts allow our
system to optimize replicas' placement depending on a specific optimization
strategy, and so to take advantage of the heterogeneity of the machines and the
network. Such optimization is particularly appealing in the context of backup:
replicas can be geographically dispersed, the load sent over the network can be
minimized, or the optimization goal can be to minimize the backup/restore time.
However, managing the contracts, keeping them consistent and adjusting them in
response to dynamically changing environment is challenging.
We built a scientific prototype and ran the experiments on 150 workstations
in the university's computer laboratories and, separately, on 50 PlanetLab
nodes. We found out that the main factor affecting the quality of the system is
the availability of the machines. Yet, our main conclusion is that it is
possible to build an efficient and reliable backup system on highly unreliable
machines (our computers had just 13% average availability)
Reliability of Heterogeneous Distributed Computing Systems in the Presence of Correlated Failures
While the reliability of distributed-computing systems (DCSs) has been widely studied under the assumption that computing elements (CEs) fail independently, the impact of correlated failures of CEs on the reliability remains an open question. Here, the problem of modeling and assessing the impact of stochastic, correlated failures on the service reliability of applications running on DCSs is tackled. The service reliability is modeled using an integrated analytical and Monte-Carlo (MC) approach. The analytical component of the model comprises a generalization of a previously developed model for reliability of non-Markovian DCSs to a setting where specific patterns of simultaneous failures in CEs are allowed. The analytical model is complemented by a MC-based procedure to draw correlated-failure patterns using the recently reported concept of probabilistic shared risk groups (PSRGs). The reliability model is further utilized to develop and optimize a novel class of dynamic task reallocation (DTR) policies that maximize the reliability of DCSs in the presence of correlated failures. Theoretical predictions, MC simulations, and results from an emulation testbed show that the reliability can be improved when DTR policies correctly account for correlated failures. The impact of correlated failures of CEs on the reliability and the key dependence of DTR policies on the type of correlated failures are also investigated
Maximizing Service Reliability in Distributed Computing Systems with Random Node Failures: Theory and Implementation
In distributed computing systems (DCSs) where server nodes can fail permanently with nonzero probability, the system performance can be assessed by means of the service reliability, defined as the probability of serving all the tasks queued in the DCS before all the nodes fail. This paper presents a rigorous probabilistic framework to analytically characterize the service reliability of a DCS in the presence of communication uncertainties and stochastic topological changes due to node deletions. The framework considers a system composed of heterogeneous nodes with stochastic service and failure times and a communication network imposing random tangible delays. The framework also permits arbitrarily specified, distributed load-balancing actions to be taken by the individual nodes in order to improve the service reliability. The presented analysis is based upon a novel use of the concept of stochastic regeneration, which is exploited to derive a system of difference-differential equations characterizing the service reliability. The theory is further utilized to optimize certain load-balancing policies for maximal service reliability; the optimization is carried out by means of an algorithm that scales linearly with the number of nodes in the system. The analytical model is validated using both Monte Carlo simulations and experimental data collected from a DCS testbed
An efficient and versatile approach to trust and reputation using hierarchical Bayesian modelling
In many dynamic open systems, autonomous agents must interact with one another to achieve their goals. Such agents may be self-interested and, when trusted to perform an action, may betray that trust by not performing the action as required. Due to the scale and dynamism of these systems, agents will often need to interact with other agents with which they have little or no past experience. Each agent must therefore be capable of assessing and identifying reliable interaction partners, even if it has no personal experience with them. To this end, we present HABIT, a Hierarchical And Bayesian Inferred Trust model for assessing how much an agent should trust its peers based on direct and third party information. This model is robust in environments in which third party information is malicious, noisy, or otherwise inaccurate. Although existing approaches claim to achieve this, most rely on heuristics with little theoretical foundation. In contrast, HABIT is based exclusively on principled statistical techniques: it can cope with multiple discrete or continuous aspects of trustee behaviour; it does not restrict agents to using a single shared representation of behaviour; it can improve assessment by using any observed correlation between the behaviour of similar trustees or information sources; and it provides a pragmatic solution to the whitewasher problem (in which unreliable agents assume a new identity to avoid bad reputation). In this paper, we describe the theoretical aspects of HABIT, and present experimental results that demonstrate its ability to predict agent behaviour in both a simulated environment, and one based on data from a real-world webserver domain. In particular, these experiments show that HABIT can predict trustee performance based on multiple representations of behaviour, and is up to twice as accurate as BLADE, an existing state-of-the-art trust model that is both statistically principled and has been previously shown to outperform a number of other probabilistic trust models
Recommended from our members
COST-EFFICIENT RESOURCE PROVISIONING FOR CLOUD-ENABLED SCHEDULERS
Since the last decade, public cloud platforms are rapidly becoming de-facto computing platform for our society. To support the wide range of users and their diverse applications, public cloud platforms started to offer the same VMs under many purchasing options that differ across their cost, performance, availability, and time commitments. Popular purchasing options include on-demand, reserved, and transient VM types. Reserved VMs require long time commitments, whereas users can acquire and release the on-demand (and transient) VMs at any time. While transient VMs cost significantly less than on-demand VMs, platforms may revoke them at any time. In general, the stronger the commitment, i.e., longer and less flexible, the lower the price. However, longer and less flexible time commitments can increase cloud costs for users if future workloads cannot utilize the VMs they committed to buying. Interestingly, this wide range of purchasing options provide opportunities for cost savings. However, large cloud customers often find it challenging to choose the right mix of purchasing options to minimize their long-term costs while retaining the ability to adjust their capacity up and down in response to workload variations. Thus, optimizing the cloud costs requires users to select a mix of VM purchasing options based on their short- and long-term expectation of workload utilization. Notably, hybrid clouds combine multiple VM purchasing options or private clusters with public cloud VMs to optimize the cloud costs based on their workload expectations. In this thesis, we address the challenge of choosing a mix of different VM purchasing options in the context of large cloud customers and thereby optimizing their cloud costs. To this end, we make the following contributions: (i) design and implement a container orchestration platform (using Kubernetes) to optimize the cost of executing mixed interactive and batch workloads on cloud platforms using on-demand and transient VMs, (ii) develop simple analytical models for different straggler mitigation techniques to better understand the cost of synchronization in distributed machine learning workloads and compare their cost and performance on on-demand and transient VMs, (iii) design multiple policies to optimize long-term cloud costs by selecting a mix of VM purchasing options based on short- and long-term expectations of workload utilization (with no job waiting), (iv) introduce the concept of waiting policy for cloud-enabled schedulers, and show that provisioning long-term resources (e.g., reserved VMs) to optimize the cloud costs is dependent on it, and (v) design and implement speculative execution and ML-based waiting time predictions (for waiting policies) to show that optimizing job waiting in the cloud is possible without accurate job runtime predictions
- ā¦