7 research outputs found

    Optimal discrete stopping times for reliability growth tests

    Get PDF
    Often, the duration of a reliability growth development test is specified in advance and the decision to terminate or continue testing is conducted at discrete time intervals. These features are normally not captured by reliability growth models. This paper adapts a standard reliability growth model to determine the optimal time for which to plan to terminate testing. The underlying stochastic process is developed from an Order Statistic argument with Bayesian inference used to estimate the number of faults within the design and classical inference procedures used to assess the rate of fault detection. Inference procedures within this framework are explored where it is shown the Maximum Likelihood Estimators possess a small bias and converges to the Minimum Variance Unbiased Estimator after few tests for designs with moderate number of faults. It is shown that the Likelihood function can be bimodal when there is conflict between the observed rate of fault detection and the prior distribution describing the number of faults in the design. An illustrative example is provided

    Optimal Locally Repairable Codes and Connections to Matroid Theory

    Full text link
    Petabyte-scale distributed storage systems are currently transitioning to erasure codes to achieve higher storage efficiency. Classical codes like Reed-Solomon are highly sub-optimal for distributed environments due to their high overhead in single-failure events. Locally Repairable Codes (LRCs) form a new family of codes that are repair efficient. In particular, LRCs minimize the number of nodes participating in single node repairs during which they generate small network traffic. Two large-scale distributed storage systems have already implemented different types of LRCs: Windows Azure Storage and the Hadoop Distributed File System RAID used by Facebook. The fundamental bounds for LRCs, namely the best possible distance for a given code locality, were recently discovered, but few explicit constructions exist. In this work, we present an explicit and optimal LRCs that are simple to construct. Our construction is based on grouping Reed-Solomon (RS) coded symbols to obtain RS coded symbols over a larger finite field. We then partition these RS symbols in small groups, and re-encode them using a simple local code that offers low repair locality. For the analysis of the optimality of the code, we derive a new result on the matroid represented by the code generator matrix.Comment: Submitted for publication, a shorter version was presented at ISIT 201

    Reliability Guided Resource Allocation for Large-scale Supercomputing Systems

    Get PDF
    In high performance computing systems, parallel applications request a large number of resources for long time periods. In this scenario, if a resource fails during the application runtime, it would cause all applications using this resource to fail. The probability of application failure is tied to the inherent reliability of resources used by the application. Our investigation of high performance computing systems operating in the field has revealed a significant difference in the measured operational reliability of individual computing nodes. By adding awareness of the individual system nodes\u27 reliability to the scheduler along with the predicted reliability needs of parallel applications, reliable resources can be matched with the most demanding applications to reduce the probability of application failure arising from resource failure. In this thesis, the researcher describes a new approach developed for resource allocation that can enhance the reliability and reduce the costs of failures of large-scale parallel applications that use high performance computing systems. This approach is based on a multi-class Erlang loss system that allows us to partition system resources based on predicted resource reliability, and to size each of these partitions to bound the probability of blocking requests to each partition while simultaneously improving the reliability of the most demanding parallel applications running on the system. Using this model, the partition mean time to failure (MTTF) is maximized and the probability of blocking of resource requests directed to each partition by a scheduling system can be controlled. This new technique can be used to determine the size of the system, to service peak loads with a bounded probability of blocking to resource requests. This approach would be useful for high performance computing system operators seeking to improve the reliability, efficiency and cost-effectiveness of their systems

    Method for evaluating an extended fault tree to analyse the dependability of complex systems: application to a satellite-based railway system

    Get PDF
    Evaluating dependability of complex systems requires the evolution of the system states over time to be analysed. The problem is to develop modelling approaches that take adequately the evolution of the different operating and failed states of the system components into account. The Fault Tree (FT) is a well- known method that efficiently analyse the failure causes of a system and serves for reliability and availability evaluations. As FT is not adapted to dynamic systems with repairable multi-state compo- nents, extensions of FT (eFT) have been developed. However efficient quantitative evaluation processes of eFT are missing. Petri nets have the advantage of allowing such evaluation but their construction is difficult to manage and their simulation performances are unsatisfactory. Therefore, we propose in this paper a new powerful process to analyse quantitatively eFT. This is based on the use of PN method, which relies on the failed states highlighted by the eFT, combined with a new analytical modelling approach for critical events that depend on time duration. The performances of the new process are demonstrated through a theoretical example of eFT and the practical use of the method is shown on a satellite-based railway system

    Network reliability analysis through survival signature and machine learning techniques

    Get PDF
    As complex networks become ubiquitous in modern society, ensuring their reliability is crucial due to the potential consequences of network failures. However, the analysis and assessment of network reliability become computationally challenging as networks grow in size and complexity. This research proposes a novel graph-based neural network framework for accurately and efficiently estimating the survival signature and network reliability. The method incorporates a novel strategy to aggregate feature information from neighboring nodes, effectively capturing the response flow characteristics of networks. Additionally, the framework utilizes the higher-order graph neural networks to further aggregate feature information from neighboring nodes and the node itself, enhancing the understanding of network topology structure. An adaptive framework along with several efficient algorithms is further proposed to improve prediction accuracy. Compared to traditional machine learning-based approaches, the proposed graph-based neural network framework integrates response flow characteristics and network topology structure information, resulting in highly accurate network reliability estimates. Moreover, once the graph-based neural network is properly constructed based on the original network, it can be directly used to estimate network reliability of different network variants, i.e., sub-networks, which is not feasible with traditional non-machine learning methods. Several applications demonstrate the effectiveness of the proposed method in addressing network reliability analysis problems

    Planning Capacity for 5G and Beyond Wireless Networks by Discrete Fireworks Algorithm With Ensemble of Local Search Methods

    Get PDF
    In densely populated urban centers, planning optimized capacity for the fifth-generation (5G) and beyond wireless networks is a challenging task. In this paper, we propose a mathematical framework for the planning capacity of a 5G and beyond wireless networks. We considered a single-hop wireless network consists of base stations (BSs), relay stations (RSs), and user equipment (UEs). Wireless network planning (WNP) should decide the placement of BSs and RSs to the candidate sites and decide the possible connections among them and their further connections to UEs. The objective of the planning is to minimize the hardware and operational cost while planning capacity of a 5G and beyond wireless networks. The formulated WNP is an integer programming problem. Finding an optimal solution by using exhaustive search is not practical due to the demand for high computing resources. As a practical approach, a new population-based meta-heuristic algorithm is proposed to find a high-quality solution. The proposed discrete fireworks algorithm (DFWA) uses an ensemble of local search methods: insert, swap, and interchange. The performance of the proposed DFWA is compared against the low-complexity biogeography-based optimization (LC-BBO), the discrete artificial bee colony (DABC), and the genetic algorithm (GA). Simulation results and statistical tests demonstrate that the proposed algorithm can comparatively find good-quality solutions with moderate computing resources
    corecore