19 research outputs found

    On-line failure prediction in safety-critical systems

    Get PDF
    In safety-critical systems such as Air Traffic Control system, SCADA systems, Railways Control Systems, there has been a rapid transition from monolithic systems to highly modular ones, using off-the-shelf hardware and software applications possibly developed by different manufactures. This shift increased the probability that a fault occurring in an application propagates to others with the risk of a failure of the entire safety-critical system. This calls for new tools for the on-line detection of anomalous behaviors of the system, predicting thus a system failure before it happens, allowing the deployment of appropriate mitigation policies. The paper proposes a novel architecture, namely CASPER, for online failure prediction that has the distinctive features to be (i) black-box: no knowledge of applications internals and logic of the system is required (ii) non-intrusive: no status information of the components is used such as CPU or memory usage; The architecture has been implemented to predict failures in a real Air Traffic Control System. CASPER exhibits high degree of accuracy in predicting failures with low false positive rate. The experimental validation shows how operators are provided with predictions issued a few hundred of seconds before the occurrence of the failure

    Contributions à la réplication de données dans les systèmes distribués à grande échelle

    Get PDF
    Data replication is a key mechanism for building a reliable and efficient data management system. Indeed, by keeping several replicas for each piece of data, it is possible to improve durability. Furthermore, well-placed copies reduce data accesstime. However, having multiple copies for a single piece of data creates consistency problems when the data is updated. Over the last years, I made contributions related to these three aspects: data durability, data access performance and data consistency. RelaxDHT and SPLAD enhance data durability by placing data copies smartly. Caju, AREN and POPS reduce access time by improving data locality and by taking popularity into account. To enhance data lookup performance, DONUT creates efficient shortcuts taking data distribution into account. Finally, in the replicated database context, Gargamel parallelizes independent transactions only, improving database performance and avoiding aborting transactions. My research has been carried out in collaboration with height PhD students, four of which have defended. In my future work, I plan to extend these contributions by (i) designing a storage system tailored for MMOGs, which are very demanding, and (ii) designing a data management system that is able to re-distribute data automatically in order to scale the number of servers up and down according to the changing workload, leading to a greener data management.La réplication de données est une technique clé pour permettre aux systèmes de gestion de données distribués à grande échelle d'offrir un stockage fiable et performant. Comme il gère un nombre suffisant de copies de chaque donnée, le système peut améliorer la pérennité. De plus, la présence de copies bien placées réduit les temps d'accès. Cependant, cette même existence de plusieurs copies pose des problèmes de cohérence en cas de modification. Ces dernières années, mes contributions ont porté sur ces trois aspects liés à la réplication de données: la pérennité des données, la performance desaccès et la gestion de la cohérence. RelaxDHT et SPLAD permettent d'améliorer la pérennité des données en jouant sur le placement des copies. Caju, AREN et POPS permettent de réduire les temps d'accès aux données en améliorant la localité et en prenant en compte la popularité. Pour accélérer la localisation des copies, DONUT crée des raccourcis efficaces prenant en compte la distribution des données. Enfin, dans le contexte des bases de données répliquées,Gargamel permet de ne paralléliser que les transactions qui sont indépendantes, améliorant ainsi les performances et évitant tout abandon de transaction pour cause de conflit. Ces travaux ont été réalisés avec huit étudiants en thèse dont quatre ont soutenu. Pour l'avenir, je me propose d'étendre ces travaux, d'une part en concevant un système de gestion de données pour les MMOGs, une classe d'application particulièrement exigeante; et, d'autre part, en concevant des mécanismes de gestion de données permettant de n'utiliser que la quantité strictement nécessaire de ressources, en redistribuant dynamiquement les données en fonction des besoins, un pas vers une gestion plus écologique des données

    Online failure prediction in air traffic control systems

    Get PDF
    This thesis introduces a novel approach to online failure prediction for mission critical distributed systems that has the distinctive features to be black-box, non-intrusive and online. The approach combines Complex Event Processing (CEP) and Hidden Markov Models (HMM) so as to analyze symptoms of failures that might occur in the form of anomalous conditions of performance metrics identified for such purpose. The thesis presents an architecture named CASPER, based on CEP and HMM, that relies on sniffed information from the communication network of a mission critical system, only, for predicting anomalies that can lead to software failures. An instance of Casper has been implemented, trained and tuned to monitor a real Air Traffic Control (ATC) system developed by Selex ES, a Finmeccanica Company. An extensive experimental evaluation of CASPER is presented. The obtained results show (i) a very low percentage of false positives over both normal and under stress conditions, and (ii) a sufficiently high failure prediction time that allows the system to apply appropriate recovery procedures

    Online failure prediction in air traffic control systems

    Get PDF
    This thesis introduces a novel approach to online failure prediction for mission critical distributed systems that has the distinctive features to be black-box, non-intrusive and online. The approach combines Complex Event Processing (CEP) and Hidden Markov Models (HMM) so as to analyze symptoms of failures that might occur in the form of anomalous conditions of performance metrics identified for such purpose. The thesis presents an architecture named CASPER, based on CEP and HMM, that relies on sniffed information from the communication network of a mission critical system, only, for predicting anomalies that can lead to software failures. An instance of Casper has been implemented, trained and tuned to monitor a real Air Traffic Control (ATC) system developed by Selex ES, a Finmeccanica Company. An extensive experimental evaluation of CASPER is presented. The obtained results show (i) a very low percentage of false positives over both normal and under stress conditions, and (ii) a sufficiently high failure prediction time that allows the system to apply appropriate recovery procedures

    Online disturbance prediction for enhanced availability in smart grids

    Get PDF
    A gradual move in the electric power industry towards Smart Grids brings new challenges to the system's efficiency and dependability. With a growing complexity and massive introduction of renewable generation, particularly at the distribution level, the number of faults and, consequently, disturbances (errors and failures) is expected to increase significantly. This threatens to compromise grid's availability as traditional, reactive management approaches may soon become insufficient. On the other hand, with grids' digitalization, real-time status data are becoming available. These data may be used to develop advanced management and control methods for a sustainable, more efficient and more dependable grid. A proactive management approach, based on the use of real-time data for predicting near-future disturbances and acting in their anticipation, has already been identified by the Smart Grid community as one of the main pillars of dependability of the future grid. The work presented in this dissertation focuses on predicting disturbances in Active Distributions Networks (ADNs) that are a part of the Smart Grid that evolves the most. These are distribution networks with high share of (renewable) distributed generation and with systems in place for real-time monitoring and control. Our main goal is to develop a methodology for proactive network management, in a sense of proactive mitigation of disturbances, and to design and implement a method for their prediction. We focus on predicting voltage sags as they are identified as one of the most frequent and severe disturbances in distribution networks. We address Smart Grid dependability in a holistic manner by considering its cyber and physical aspects. As a result, we identify Smart Grid dependability properties and develop a taxonomy of faults that contribute to better understanding of the overall dependability of the future grid. As the process of grid's digitization is still ongoing there is a general problem of a lack of data on the grid's status and especially disturbance-related data. These data are necessary to design an accurate disturbance predictor. To overcome this obstacle we introduce a concept of fault injection to simulation of power systems. We develop a framework to simulate a behavior of distribution networks in the presence of faults, and fluctuating generation and load that, alone or combined, may cause disturbances. With the framework we generate a large set of data that we use to develop and evaluate a voltage-sag disturbance predictor. To quantify how prediction and proactive mitigation of disturbances enhance availability we create an availability model of a proactive management. The model is generic and may be applied to evaluate the effect of proactive management on availability in other types of systems, and adapted for quantifying other types of properties as well. Also, we design a metric and a method for optimizing failure prediction to maximize availability with proactive approach. In our conclusion, the level of availability improvement with proactive approach is comparable to the one when using high-reliability and costly components. Following the results of the case study conducted for a 14-bus ADN, grid's availability may be improved by up to an order of magnitude if disturbances are managed proactively instead of reactively. The main results and contributions may be summarized as follows: (i) Taxonomy of faults in Smart Grid has been developed; (ii) Methodology and methods for proactive management of disturbances have been proposed; (iii) Model to quantify availability with proactive management has been developed; (iv) Simulation and fault-injection framework has been designed and implemented to generate disturbance-related data; (v) In the scope of a case study, a voltage-sag predictor, based on machine- learning classification algorithms, has been designed and the effect of proactive disturbance management on downtime and availability has been quantified

    Resilience-Building Technologies: State of Knowledge -- ReSIST NoE Deliverable D12

    Get PDF
    This document is the first product of work package WP2, "Resilience-building and -scaling technologies", in the programme of jointly executed research (JER) of the ReSIST Network of Excellenc

    Preliminary Specification of Services and Protocols

    Get PDF
    This document describes the preliminary specification of services and protocols for the Crutial Architecture. The Crutial Architecture definition, first addressed in Crutial Project Technical Report D4 (January 2007), intends to reply to a grand challenge of computer science and control engineering: how to achieve resilience of critical information infrastructures, in particular in the electrical sector. The definitions herein elaborate on the major architectural options and components established in the Preliminary Architecture Specification (D4), with special relevance to the Crutial middleware building blocks, and are based on the fault, synchrony and topological models defined in the same document. The document, in general lines, describes the Runtime Support Services and APIs, and the Middleware Services and APIs. Then, it delves into the protocols, describing: Runtime Support Protocols, and Middleware Services Protocols. The Runtime Support Services and APIs chapter features as a main component, the Proactive-Reactive Recovery Service, whose aim is to guarantee perpetual execution of any components it protects. The Middleware Services and APIs chapter describes our approach to intrusion-tolerant middleware. The middleware comprises several layers. The Multipoint Network layer is the lowest layer of CRUTIAL's middleware, and features an abstraction of basic communication services, such as provided by standard protocols, like IP, IPsec, UDP, TCP and SSL/TLS. The Communication Support Services feature two important building blocks: the Randomized Intrusion-Tolerant Services (RITAS), and the Overlay Protection Layer (OPL) against DoS attacks. The Activity Support Services currently defined comprise the CIS Protection service, and the Access Control and Authorization service. Protection as described in this report is implemented by mechanisms and protocols residing on a device called Crutial Information Switch (CIS). The Access Control and Authorization service is implemented through PolyOrBAC, which defines the rules for information exchange and collaboration between sub-modules of the architecture, corresponding in fact to different facilities of the CII's organizations.The Monitoring and Failure Detection layer contains a preliminary definition of the middleware services devoted to monitoring and failure detection activities. The remaining chapters describe the protocols implementing the above-mentioned services: Runtime Support Protocols, and Middleware Services Protocol
    corecore