7 research outputs found

    Identifying recovery patterns from resource usage data of cluster systems

    Get PDF
    Failure of Cluster Systems has proven to be of adverse effect and it can be costly. System administrators have employed divide and conquer approach to diagnosing the root-cause of such failure in order to take corrective or preventive measures. Most times, event logs are the source of the information about the failures. Events that characterized failures are then noted and categorized as causes of failure. However, not all the ’causative’ events lead to eventual failure, as some faults sequence experience recovery. Such sequences or patterns constitute challenge to system administrators and failure prediction tools as they add to false positives. Their presence are always predicted as “failure causing“, while in reality, they will not. In order to detect such recovery patterns of events from failure patterns, we proposed a novel approach that utilizes resource usage data of cluster systems to identify recovery and failure sequences. We further propose an online detection approach to the same problem. We experiment our approach on data from Ranger Supercomputer System and the results are positive.Keywords: Change point detection; resource usage data; recovery sequence; detection; large-scale HPC system

    Towards efficient error detection in large-scale HPC systems

    Get PDF
    The need for computer systems to be reliable has increasingly become important as the dependence on their accurate functioning by users increases. The failure of these systems could very costly in terms of time and money. In as much as system's designers try to design fault-free systems, it is practically impossible to have such systems as different factors could affect them. In order to achieve system's reliability, fault tolerance methods are usually deployed; these methods help the system to produce acceptable results even in the presence of faults. Root cause analysis, a dependability method for which the causes of failures are diagnosed for the purpose of correction or prevention of future occurrence is less efficient. It is reactive and would not prevent the first failure from occurring. For this reason, methods with predictive capabilities are preferred; failure prediction methods are employed to predict the potential failures to enable preventive measures to be applied. Most of the predictive methods have been supervised, requiring accurate knowledge of the system's failures, errors and faults. However, with changing system components and system updates, supervised methods are ineffective. Error detection methods allows error patterns to be detected early to enable preventive methods to be applied. Performing this detection in an unsupervised way could be more effective as changes to systems or updates would less affect such a solution. In this thesis, we introduced an unsupervised approach to detecting error patterns in a system using its data. More specifically, the thesis investigates the use of both event logs and resource utilization data to detect error patterns. It addresses both the spatial and temporal aspects of achieving system dependability. The proposed unsupervised error detection method has been applied on real data from two different production systems. The results are positive; showing average detection F-measure of about 75%

    Monitorização aplicacional - Sputnik, Checklist e Gateway Asterisk

    Get PDF
    Tese de mestrado em Engenharia Informática, apresentada à Universidade de Lisboa, através da Faculdade de Ciências, 2012Todos os Sistemas de Informação encontram-se susceptíveis a falhas, sejam elas humanas, de infra-estrutura ou aplicacionais e, portanto, necessitam de constante monitorização para não haver quebras de serviço que afectem o negócio. Esta necessidade de monitorizar e de actuar o mais rapidamente possível sobre sistemas críticos e o facto de trabalhar numa área em que a função principal é monitorizar os sistemas aplicacionais e infra-estruturais da PT, abriu-me portas para encontrar novas soluções que ajudem as equipas de operação. O Sputnik é uma plataforma de representação gráfica e intuitiva de monitorizações infra-estruturais e aplicacionais, permitindo representar circuitos, tabelas, gráficos ou por exemplo, verificando a disponibilidade de um servidor ou o acumular de registos numa tabela. Com esta plataforma os problemas são detectados/visualizados em real-time e resolvidos ou escalados de forma rápida e eficiente. A Checklist é uma plataforma centralizada onde o conhecimento e a informação estão organizados/estruturados em CIs (Configuration Items) e relações. Pretende ser a base de consulta para qualquer tipo de informação relevante, incluindo relações entre os vários CIs existentes, para a DE/OA. A Gateway Asterisk é uma plataforma piloto (proof-of-concept) destinada a despoletar chamadas telefónicas automáticas substituindo a necessidade de acção humana, agilizando assim o processo de escalamento. No contexto do meu PEI as chamadas são destinadas a alertar as equipas responsáveis em caso de falha dos CIs. Neste PEI pretendi não só melhorar/desenvolver as plataformas de alarmística sobre os sistemas da PTSI mas também melhorar os meus conhecimentos técnicos, de gestão e dar um contributo efectivo sobre os processos de monitorização e alarmística minimizando a acção humana e auxiliando-a sempre que necessário.All information systems are prone to failures, whether they’re human, infrastructural or applicational, therefore requiring constant monitoring in order to prevent service unavailability affecting business. Sputnik is a platform for graphical and intuitive representation of infrastructural and applicational monitoring, allowing to represent circuits, tables, graphs or for example checking the availability of a server or the accumulation of records in a table. This platform allows real-time problem detection/visualization and supports it’s resolution/escalation in a efficient way. The Checklist is a centralized platform where knowledge and information are organized/ structured in CI’s (Configuration Items) and relationships between them. Intended as a framework for searching any type of relevant information, including relationships between the existing CIs. Asterisk Gateway is a pilot platform (proof-of-concept), designed to trigger automatic phone calls replacing the need for human action, thus speeding the process of scaling. In the context of this PEI, calls are triggered when a CI falis, alerting the respective support team. In this PEI I intended not only, to improve/develop alarmistic platforms over the PTSI systems but also to improve my technical knowledge, management and effective input on the process of monitoring and alarms, minimizing human action and helping them when necessary

    Dynamic model-based safety analysis: from state machines to temporal fault trees

    Get PDF
    Finite state transition models such as State Machines (SMs) have become a prevalent paradigm for the description of dynamic systems. Such models are well-suited to modelling the behaviour of complex systems, including in conditions of failure, and where the order in which failures and fault events occur can affect the overall outcome (e.g. total failure of the system). For the safety assessment though, the SM failure behavioural models need to be converted to analysis models like Generalised Stochastic Petri Nets (GSPNs), Markov Chains (MCs) or Fault Trees (FTs). This is particularly important if the transformed models are supported by safety analysis tools.This thesis, firstly, identifies a number of problems encountered in current safety analysis techniques based on SMs. One of the existing approaches consists of transforming the SMs to analysis-supported state-transition formalisms like GSPNs or MCs, which are very powerful in capturing the dynamic aspects and in the evaluation of safety measures. But in this approach, qualitative analysis is not encouraged; here the focus is primarily on probabilistic analysis. Qualitative analysis is particularly important when probabilistic data are not available (e.g., at early stages of design). In an alternative approach though, the generation of combinatorial, Boolean FTs has been applied to SM-based models. FTs are well-suited to qualitative analysis, but cannot capture the significance of the temporal order of events expressed by SMs. This makes the approach potentially error prone for the analysis of dynamic systems. In response, we propose a new SM-based safety analysis technique which converts SMs to Temporal Fault Trees (TFTs) using Pandora — a recent technique for introducing temporal logic to FTs. Pandora provides a set of temporal laws, which allow the significance of the SM temporal semantics to be preserved along the logical analysis, and thereby enabling a true qualitative analysis of a dynamic system. The thesis develops algorithms for conversion of SMs to TFTs. It also deals with the issue of scalability of the approach by proposing a form of compositional synthesis in which system large TFTs can be generated from individual component SMs using a process of composition. This has the dual benefits of allowing more accurate analysis of different sequences of faults, and also helping to reduce the cost of performing temporal analysis by producing smaller, more manageable TFTs via the compositionality.The thesis concludes that this approach can potentially address limitations of earlier work and thus help to improve the safety analysis of increasingly complex dynamic safety-critical systems

    Distributed Security Policy Analysis

    Get PDF
    Computer networks have become an important part of modern society, and computer network security is crucial for their correct and continuous operation. The security aspects of computer networks are defined by network security policies. The term policy, in general, is defined as ``a definite goal, course or method of action to guide and determine present and future decisions''. In the context of computer networks, a policy is ``a set of rules to administer, manage, and control access to network resources''. Network security policies are enforced by special network appliances, so called security controls.Different types of security policies are enforced by different types of security controls. Network security policies are hard to manage, and errors are quite common. The problem exists because network administrators do not have a good overview of the network, the defined policies and the interaction between them. Researchers have proposed different techniques for network security policy analysis, which aim to identify errors within policies so that administrators can correct them. There are three different solution approaches: anomaly analysis, reachability analysis and policy comparison. Anomaly analysis searches for potential semantic errors within policy rules, and can also be used to identify possible policy optimizations. Reachability analysis evaluates allowed communication within a computer network and can determine if a certain host can reach a service or a set of services. Policy comparison compares two or more network security policies and represents the differences between them in an intuitive way. Although research in this field has been carried out for over a decade, there is still no clear answer on how to reduce policy errors. The different analysis techniques have their pros and cons, but none of them is a sufficient solution. More precisely, they are mainly complements to each other, as one analysis technique finds policy errors which remain unknown to another. Therefore, to be able to have a complete analysis of the computer network, multiple models must be instantiated. An analysis model that can perform all types of analysis techniques is desirable and has three main advantages. Firstly, the model can cover the greatest number of possible policy errors. Secondly, the computational overhead of instantiating the model is required only once. Thirdly, research effort is reduced because improvements and extensions to the model are applied to all three analysis types at the same time. Fourthly, new algorithms can be evaluated by comparing their performance directly to each other. This work proposes a new analysis model which is capable of performing all three analysis techniques. Security policies and the network topology are represented by the so-called Geometric-Model. The Geometric-Model is a formal model based on the set theory and geometric interpretation of policy rules. Policy rules are defined according to the condition-action format: if the condition holds then the action is applied. A security policy is expressed as a set of rules, a resolution strategy which selects the action when more than one rule applies, external data used by the resolution strategy and a default action in case no rule applies. This work also introduces the concept of Equivalent-Policy, which is calculated on the network topology and the policies involved. All analysis techniques are performed on it with a much higher performance. A precomputation phase is required for two reasons. Firstly, security policies which modify the traffic must be transformed to gain linear behaviour. Secondly, there are much fewer rules required to represent the global behaviour of a set of policies than the sum of the rules in the involved policies. The analysis model can handle the most common security policies and is designed to be extensible for future security policy types. As already mentioned the Geometric-Model can represent all types of security policies, but the calculation of the Equivalent-Policy has some small dependencies on the details of different policy types. Therefore, the computation of the Equivalent-Policy must be tweaked to support new types. Since the model and the computation of the Equivalent-Policy was designed to be extendible, the effort required to introduce a new security policy type is minimal. The anomaly analysis can be performed on computer networks containing different security policies. The policy comparison can perform an Implementation-Verification among high-level security requirements and an entire computer network containing different security policies. The policy comparison can perform a ChangeImpact-Analysis of an entire network containing different security policies. The proposed model is implemented in a working prototype, and a performance evaluation has been performed. The performance of the implementation is more than sufficient for real scenarios. Although the calculation of the Equivalent-Policy requires a significant amount of time, it is still manageable and is required only once. The execution of the different analysis techniques is fast, and generally the results are calculated in real time. The implementation also exposes an API for future integration in different frameworks or software packages. Based on the API, a complete tool was implemented, with a graphical user interface and additional features

    Third workshop on proactive failure avoidance, recovery, and maintenance (PFARM)

    No full text
    corecore