9 research outputs found

    Maximizing Service Reliability in Distributed Computing Systems with Random Node Failures: Theory and Implementation

    Get PDF
    In distributed computing systems (DCSs) where server nodes can fail permanently with nonzero probability, the system performance can be assessed by means of the service reliability, defined as the probability of serving all the tasks queued in the DCS before all the nodes fail. This paper presents a rigorous probabilistic framework to analytically characterize the service reliability of a DCS in the presence of communication uncertainties and stochastic topological changes due to node deletions. The framework considers a system composed of heterogeneous nodes with stochastic service and failure times and a communication network imposing random tangible delays. The framework also permits arbitrarily specified, distributed load-balancing actions to be taken by the individual nodes in order to improve the service reliability. The presented analysis is based upon a novel use of the concept of stochastic regeneration, which is exploited to derive a system of difference-differential equations characterizing the service reliability. The theory is further utilized to optimize certain load-balancing policies for maximal service reliability; the optimization is carried out by means of an algorithm that scales linearly with the number of nodes in the system. The analytical model is validated using both Monte Carlo simulations and experimental data collected from a DCS testbed

    Reliability of Heterogeneous Distributed Computing Systems in the Presence of Correlated Failures

    Get PDF
    While the reliability of distributed-computing systems (DCSs) has been widely studied under the assumption that computing elements (CEs) fail independently, the impact of correlated failures of CEs on the reliability remains an open question. Here, the problem of modeling and assessing the impact of stochastic, correlated failures on the service reliability of applications running on DCSs is tackled. The service reliability is modeled using an integrated analytical and Monte-Carlo (MC) approach. The analytical component of the model comprises a generalization of a previously developed model for reliability of non-Markovian DCSs to a setting where specific patterns of simultaneous failures in CEs are allowed. The analytical model is complemented by a MC-based procedure to draw correlated-failure patterns using the recently reported concept of probabilistic shared risk groups (PSRGs). The reliability model is further utilized to develop and optimize a novel class of dynamic task reallocation (DTR) policies that maximize the reliability of DCSs in the presence of correlated failures. Theoretical predictions, MC simulations, and results from an emulation testbed show that the reliability can be improved when DTR policies correctly account for correlated failures. The impact of correlated failures of CEs on the reliability and the key dependence of DTR policies on the type of correlated failures are also investigated

    Maximizing service reliability in distributed computing systems with random failures: Theory and implementation,”

    Get PDF
    Abstract-In distributed computing systems (DCSs) where server nodes can fail permanently with nonzero probability, the system performance can be assessed by means of the service reliability, defined as the probability of serving all the tasks queued in the DCS before all the nodes fail. This paper presents a rigorous probabilistic framework to analytically characterize the service reliability of a DCS in the presence of communication uncertainties and stochastic topological changes due to node deletions. The framework considers a system composed of heterogeneous nodes with stochastic service and failure times and a communication network imposing random tangible delays. The framework also permits arbitrarily specified, distributed load-balancing actions to be taken by the individual nodes in order to improve the service reliability. The presented analysis is based upon a novel use of the concept of stochastic regeneration, which is exploited to derive a system of difference-differential equations characterizing the service reliability. The theory is further utilized to optimize certain load-balancing policies for maximal service reliability; the optimization is carried out by means of an algorithm that scales linearly with the number of nodes in the system. The analytical model is validated using both Monte Carlo simulations and experimental data collected from a DCS testbed

    Theory of Resource Allocation for Robust Distributed Computing

    Get PDF
    Lately, distributed computing (DC) has emerged in several application scenarios such as grid computing, high-performance and reconfigurable computing, wireless sensor networks, battle management systems, peer-to-peer networks, and donation grids. When DC is performed in these scenarios, the distributed computing system (DCS) supporting the applications not only exhibits heterogeneous computing resources and a significant communication latency, but also becomes highly dynamic due to the communication network as well as the computing servers are affected by a wide class of anomalies that change the topology of the system in a random fashion. These anomalies exhibit spatial and/or temporal correlation when they result, for instance, from wide-area power or network outages These correlated failures may not only inflict a large amount of damage to the system, but they may also induce further failures in other servers as a result of the lack of reliable communication between the components of the DCS. In order to provide a robust DC environment in the presence of component failures, it is key to develop a general framework for accurately modeling the complex dynamics of a DCS. In this dissertation a novel approach has been undertaken for modeling a general class of DCSs and for analytically characterizing the performance and reliability of parallel applications executed on such systems. A general probabilistic model has been constructed by assuming that the random times governing the dynamics of the DCS follow arbitrary probability distributions with heterogeneous parameters. Auxiliary age variables have been introduced in the modeling of a DCS and a hybrid continuous and discrete state-space model the system has been constructed. This hybrid model has enabled the development of an age-dependent stochastic regeneration theory, which, in turn, has been employed to analytically characterize the average execution time, the quality-of-service and the reliability in serving an application. These are three metrics of performance and reliability of practical interest in DC. Analytical approximations as well as mathematical lower and upper bounds for these metrics have also been derived in an attempt to reduce the amount of computational resources demanded by the exact characterizations. In order to systematically assess the reliability of DCSs in the presence of correlated component failures, a novel probabilistic model for spatially correlated failures has been developed. The model, based on graph theory and Markov random fields, captures both geographical and logical correlations induced by the arbitrary topology of the communication network of a DCS. The modeling framework, in conjunction with a general class of dynamic task reallocation (DTR) control policies, has been used to optimize the performance and reliability of applications in the presence of independent as well as spatially correlated anomalies. Theoretical predictions, Monte- Carlo simulations as well as experimental results have shown that optimizing these metrics can significantly impact the performance of a DCS. Moreover, the general setting developed here has shed insights on: (i) the effect of different stochastic mod- els on the accuracy of the performance and reliability metrics, (ii) the dependence of the DTR policies on system parameters such as failure rates and task-processing rates, (iii) the severe impact of correlated failures on the reliability of DCSs, (iv) the dependence of the DTR policies on degree of correlation in the failures, and (v) the fundamental trade-off between minimizing the execution time of an application and maximizing its reliability

    On dynamic resource allocation in systems with bursty sources

    Get PDF
    There is a trend to use computing resources in a way that is more removed from the technical constraints. Users buy compute time on machines that they do not control or necessarily know the specifics of. Conversely this means the providers of such resources have more freedom in allocating them amongst different tasks. They can use this freedom to provide more, or better, service by reallocating resources as demand for them changes. However deciding when to reallocate resources is not trivial. In order to make good reallocation decisions, this thesis constructs a series of models. Each of the models concerns a resource allocation problem in the presence of bursty sources. The focus of the modelling, however, varies. In its most basic form it considers several different job types competing over the allocation of a limited number of servers. The goal there is to minimize the (weighted) mean time jobs spend in the system. The weighting can reflect the relative importance of the different job types. Reallocation of servers between job types is in general considered to be neither free nor instantaneous. We then show how to find the optimal static allocation of servers over job types. Finding the optimal dynamic allocation of servers is formulated as solving a Markov decision process. We show that this is practically unfeasible for all but the most simple systems. Instead a number of heuristics are introduced. Some are fluid-approximation based and some are parameterless, i.e. do not require the a priori knowledge of parameters of the system. The performance of these heuristic policies is then explored in a series of simulations. A slightly different model is formulated next. Its goal is not to optimize allocation of servers over several job types, but rather between powered up and powered down states. In the powered up state servers can provide service for incoming jobs. In the powered down state servers cannot service incoming jobs but incur a profit due to power savings. Balancing power and performance is again formulated as a Markov decision process. This is not explicitly solved but instead some of the heuristics considered earlier are adapted to give dynamic policies for powering servers up and v down. Their performance is again tested in a number of simulations, including some where the arrival process is not only bursty but also non-Markovian. The third and final model considers allocation of servers over different job types again. This time the servers experience breakdowns and subsequent repairs. During a repair period the servers cannot process any incoming jobs. To reduce the complexity of this model, it is assumed that switches of servers between job types are instantaneous, albeit not necessarily free. This is modeled as a Markov decision process and we show how to find the optimal static allocation of servers. For the dynamic allocation previously considered heuristics are adapted again. Simulations then show the performance of these heuristics and the optimal static allocation in a number of scenarios.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Empirical and Analytical Evaluation of Systems with Multiple Unreliable Servers

    No full text
    We construct, analyze and solve models of systems where a number of servers offer services to an incoming stream of demands. Each server goes through alternating periods of being operative and inoperative. The objective is to evaluate and optimize performance and cost metrics. A large real-life data set containing information about server breakdowns is analyzed first. The results indicate that the durations of the operative periods are not distributed exponentially. However, hyperexponential distributions are found to be a good fit for the observed data. A model based on these distributions is then formulated, and is solved exactly using the method of spectral expansion. A simple approximation which is accurate for heavily loaded systems is also proposed. The results of a number of numerical experiments are reported. 1

    Empirical and Analytical Evaluation of Systems with Multiple Unreliable Servers

    Get PDF
    corecore