2 research outputs found

    Reliability of Heterogeneous Distributed Computing Systems in the Presence of Correlated Failures

    Get PDF
    While the reliability of distributed-computing systems (DCSs) has been widely studied under the assumption that computing elements (CEs) fail independently, the impact of correlated failures of CEs on the reliability remains an open question. Here, the problem of modeling and assessing the impact of stochastic, correlated failures on the service reliability of applications running on DCSs is tackled. The service reliability is modeled using an integrated analytical and Monte-Carlo (MC) approach. The analytical component of the model comprises a generalization of a previously developed model for reliability of non-Markovian DCSs to a setting where specific patterns of simultaneous failures in CEs are allowed. The analytical model is complemented by a MC-based procedure to draw correlated-failure patterns using the recently reported concept of probabilistic shared risk groups (PSRGs). The reliability model is further utilized to develop and optimize a novel class of dynamic task reallocation (DTR) policies that maximize the reliability of DCSs in the presence of correlated failures. Theoretical predictions, MC simulations, and results from an emulation testbed show that the reliability can be improved when DTR policies correctly account for correlated failures. The impact of correlated failures of CEs on the reliability and the key dependence of DTR policies on the type of correlated failures are also investigated

    Data Preservation Under Spatial Failures in Sensor Networks

    No full text
    Abstract—In this paper, we address the problem of preserving generated data in a sensor network in case of node failures. We focus on the type of node failures that have explicit spatial shapes such as circles or rectangles (e.g., modeling a bomb attack or a river overflow). We consider two different schemes for introducing redundancy in the network, by simply replicating data or by using erasure codes, with the objective to minimize the communication cost incurred to build such data redundancy. We prove that the problem is NP-hard using either replication or coding. We propose O(α)-approximation algorithm for each of the schemes, where α is the “fatness ” of the potential node failure events. We also design a distributed approximation algorithm using erasure codes. Simulation results show that by exploiting the spatial properties of the node failure patterns, one can substantially reduce the communication cost, compared with resilient data storage schemes in the prior literature. I
    corecore