4 research outputs found

    Performance and Reliability of Non-Markovian Heterogeneous Distributed Computing Systems

    Get PDF
    Average service time, quality-of-service (QoS), and service reliability associated with heterogeneous parallel and distributed computing systems (DCSs) are analytically characterized in a realistic setting for which tangible, stochastic communication delays are present with nonexponential distributions. The departure from the traditionally assumed exponential distributions for event times, such as task-execution times, communication arrival times and load-transfer delays, gives rise to a non-Markovian dynamical problem for which a novel age dependent, renewal-based distributed queuing model is developed. Numerical examples offered by the model shed light on the operational and system settings for which the Markovian setting, resulting from employing an exponential-distribution assumption on the event times, yields inaccurate predictions. A key benefit of the model is that it offers a rigorous framework for devising optimal dynamic task reallocation (DTR) policies systematically in heterogeneous DCSs by optimally selecting the fraction of the excess loads that need to be exchanged among the servers, thereby controlling the degree of cooperative processing in a DCSs. Key results on performance prediction and optimization of DCSs are validated using Monte-Carlo (MC) simulation as well as experiments on a distributed computing testbed. The scalability, in the number of servers, of the age-dependent model is studied and a linearly scalable analytical approximation is derived

    Reliability of Heterogeneous Distributed Computing Systems in the Presence of Correlated Failures

    Get PDF
    While the reliability of distributed-computing systems (DCSs) has been widely studied under the assumption that computing elements (CEs) fail independently, the impact of correlated failures of CEs on the reliability remains an open question. Here, the problem of modeling and assessing the impact of stochastic, correlated failures on the service reliability of applications running on DCSs is tackled. The service reliability is modeled using an integrated analytical and Monte-Carlo (MC) approach. The analytical component of the model comprises a generalization of a previously developed model for reliability of non-Markovian DCSs to a setting where specific patterns of simultaneous failures in CEs are allowed. The analytical model is complemented by a MC-based procedure to draw correlated-failure patterns using the recently reported concept of probabilistic shared risk groups (PSRGs). The reliability model is further utilized to develop and optimize a novel class of dynamic task reallocation (DTR) policies that maximize the reliability of DCSs in the presence of correlated failures. Theoretical predictions, MC simulations, and results from an emulation testbed show that the reliability can be improved when DTR policies correctly account for correlated failures. The impact of correlated failures of CEs on the reliability and the key dependence of DTR policies on the type of correlated failures are also investigated

    Stochastic Dynamics of Cascading Failures in Electric-Cyber Infrastructures

    Get PDF
    Emerging smart grids consist of tightly-coupled systems, namely a power grid and a communication system. While today\u27s power grids are highly reliable and modern control and communication systems have been deployed to further enhance their reliability, historical data suggest that they are yet vulnerable to large failures. A small set of initial disturbances in power grids in conjunction with lack of effective, corrective actions in a timely manner can trigger a sequence of dependent component failures, called cascading failures. The main thrust of this dissertation is to build a probabilistic framework for modeling cascading failures in power grids while capturing their interactions with the coupled communication systems so that the risk of cascading failures in the composite complex electric-cyber infrastructures can be examined, analyzed and predicted. A scalable and analytically tractable continuous-time Markov chain model for stochastic dynamics of cascading failures in power grids is constructed while retaining key physical attributes and operating characteristics of the power grid. The key idea of the proposed framework is to simplify the state space of the complex power system while capturing the effects of the omitted variables through the transition probabilities and their parametric dependence on physical attributes and operating characteristics of the system. In particular, the effects of the interdependencies between the power grid and the communication system have been captured by a parametric formulation of the transition probabilities using Monte-Carlo simulations of cascading failures. The cascading failures are simulated with a coupled power-system simulation framework, which is also developed in this dissertation. Specifically, the probabilistic model enables the prediction of the evolution of the blackout probability in time. Furthermore, the asymptotic analysis of the blackout probability as time tends to infinity enables the calculation of the probability mass function of the blackout size, which has been shown to have a heavy tail, e.g., power-law distribution, specifically when the grid is operating under stress scenarios. A key benefit of the model is that it enables the characterization of the severity of cascading failures in terms of a set of operating characteristics of the power grid. As a generalization to the Markov chain model, a regeneration-based model for cascading failures is also developed. The regeneration-based framework is capable of modeling cascading failures in a more general setting where the probability distribution of events in the system follows an arbitrarily specified distribution with non-Markovian characteristics. Further, a novel interdependent Markov chain model is developed, which provides a general probabilistic framework for capturing the effects of interactions among interdependent infrastructures on cascading failures. A key insight obtained from this model is that interdependencies between two systems can make two individually reliable systems behave unreliably. In particular, we show that due to the interdependencies two chains with non-heavy tail asymptotic failure distribution can result in a heavy tail distribution when coupled. Lastly, another aspect of future smart grids is studied by characterizing the fundamental bounds on the information rate in the sensor network that monitors the power grid. Specifically, a distributed source coding framework is presented that enables an improved estimate of the lower bound for the minimum required communication capacity to accurately describe the state of components in the information-centric power grid. The models developed in this dissertation provide critical understanding of cascading failures in electric-cyber infrastructures and facilitate reliable and quick detection of the risk of blackouts and precursors to cascading failures. These capabilities can guide the design of efficient communication systems and cascade aware control policies for future smart grids
    corecore