1,887 research outputs found

    Policy Based Priority Object Consistency Check and Correction in a Filesystem

    Get PDF
    File storage in cloud deployment and Enterprise needs of File based storage are growing rapidly in today\u27s IT environment. Filesystem enables application programs to store and retrieve user data files, and applications have data dependence on the filesystem to guarantee user with correct and accurate information. Therefore, filesystem has to maintain metadata consistency across many data structures that actually points to filesystem objects such as metadata files, application files and directories. In both offline and online filesystem check, traditional approach has been to check metadata consistency across all the filesystem objects. But we are proposing an alternative method to guarantee all-time consistency on selective filesystem objects

    Framework for a space shuttle main engine health monitoring system

    Get PDF
    A framework developed for a health management system (HMS) which is directed at improving the safety of operation of the Space Shuttle Main Engine (SSME) is summarized. An emphasis was placed on near term technology through requirements to use existing SSME instrumentation and to demonstrate the HMS during SSME ground tests within five years. The HMS framework was developed through an analysis of SSME failure modes, fault detection algorithms, sensor technologies, and hardware architectures. A key feature of the HMS framework design is that a clear path from the ground test system to a flight HMS was maintained. Fault detection techniques based on time series, nonlinear regression, and clustering algorithms were developed and demonstrated on data from SSME ground test failures. The fault detection algorithms exhibited 100 percent detection of faults, had an extremely low false alarm rate, and were robust to sensor loss. These algorithms were incorporated into a hierarchical decision making strategy for overall assessment of SSME health. A preliminary design for a hardware architecture capable of supporting real time operation of the HMS functions was developed. Utilizing modular, commercial off-the-shelf components produced a reliable low cost design with the flexibility to incorporate advances in algorithm and sensor technology as they become available

    FAST : a fault detection and identification software tool

    Get PDF
    The aim of this work is to improve the reliability and safety of complex critical control systems by contributing to the systematic application of fault diagnosis. In order to ease the utilization of fault detection and isolation (FDI) tools in the industry, a systematic approach is required to allow the process engineers to analyze a system from this perspective. In this way, it should be possible to analyze this system to find if it provides the required fault diagnosis and redundancy according to the process criticality. In addition, it should be possible to evaluate what-if scenarios by slightly modifying the process (f.i. adding sensors or changing their placement) and evaluating the impact in terms of the fault diagnosis and redundancy possibilities. Hence, this work proposes an approach to analyze a process from the FDI perspective and for this purpose provides the tool FAST which covers from the analysis and design phase until the final FDI supervisor implementation in a real process. To synthesize the process information, a very simple format has been defined based on XML. This format provides the needed information to systematically perform the Structural Analysis of that process. Any process can be analyzed, the only restriction is that the models of the process components need to be available in the FAST tool. The processes are described in FAST in terms of process variables, components and relations and the tool performs the structural analysis of the process obtaining: (i) the structural matrix, (ii) the perfect matching, (iii) the analytical redundancy relations (if any) and (iv) the fault signature matrix. To aid in the analysis process, FAST can operate stand alone in simulation mode allowing the process engineer to evaluate the faults, its detectability and implement changes in the process components and topology to improve the diagnosis and redundancy capabilities. On the other hand, FAST can operate on-line connected to the process plant through an OPC interface. The OPC interface enables the possibility to connect to almost any process which features a SCADA system for supervisory control. When running in on-line mode, the process is monitored by a software agent known as the Supervisor Agent. FAST has also the capability of implementing distributed FDI using its multi-agent architecture. The tool is able to partition complex industrial processes into subsystems, identify which process variables need to be shared by each subsystem and instantiate a Supervision Agent for each of the partitioned subsystems. The Supervision Agents once instantiated will start diagnosing their local components and handle the requests to provide the variable values which FAST has identified as shared with other agents to support the distributed FDI process.Per tal de facilitar la utilització d'eines per la detecció i identificació de fallades (FDI) en la indústria, es requereix un enfocament sistemàtic per permetre als enginyers de processos analitzar un sistema des d'aquesta perspectiva. D'aquesta forma, hauria de ser possible analitzar aquest sistema per determinar si proporciona el diagnosi de fallades i la redundància d'acord amb la seva criticitat. A més, hauria de ser possible avaluar escenaris de casos modificant lleugerament el procés (per exemple afegint sensors o canviant la seva localització) i avaluant l'impacte en quant a les possibilitats de diagnosi de fallades i redundància. Per tant, aquest projecte proposa un enfocament per analitzar un procés des de la perspectiva FDI i per tal d'implementar-ho proporciona l'eina FAST la qual cobreix des de la fase d'anàlisi i disseny fins a la implementació final d'un supervisor FDI en un procés real. Per sintetitzar la informació del procés s'ha definit un format simple basat en XML. Aquest format proporciona la informació necessària per realitzar de forma sistemàtica l'Anàlisi Estructural del procés. Qualsevol procés pot ser analitzat, només hi ha la restricció de que els models dels components han d'estar disponibles en l'eina FAST. Els processos es descriuen en termes de variables de procés, components i relacions i l'eina realitza l'anàlisi estructural obtenint: (i) la matriu estructural, (ii) el Perfect Matching, (iii) les relacions de redundància analítica, si n'hi ha, i (iv) la matriu signatura de fallades. Per ajudar durant el procés d'anàlisi, FAST pot operar aïlladament en mode de simulació permetent a l'enginyer de procés avaluar fallades, la seva detectabilitat i implementar canvis en els components del procés i la topologia per tal de millorar les capacitats de diagnosi i redundància. Per altra banda, FAST pot operar en línia connectat al procés de la planta per mitjà d'una interfície OPC. La interfície OPC permet la possibilitat de connectar gairebé a qualsevol procés que inclogui un sistema SCADA per la seva supervisió. Quan funciona en mode en línia, el procés està monitoritzat per un agent software anomenat l'Agent Supervisor. Addicionalment, FAST té la capacitat d'implementar FDI de forma distribuïda utilitzant la seva arquitectura multi-agent. L'eina permet dividir sistemes industrials complexes en subsistemes, identificar quines variables de procés han de ser compartides per cada subsistema i generar una instància d'Agent Supervisor per cadascun dels subsistemes identificats. Els Agents Supervisor un cop activats, començaran diagnosticant els components locals i despatxant les peticions de valors per les variables que FAST ha identificat com compartides amb altres agents, per tal d'implementar el procés FDI de forma distribuïda.Postprint (published version

    Machine Learning applied to fault correlation

    Get PDF
    Dissertação de mestrado em Engenharia InformáticaOver the last years, one of the areas that have most evolved and extended its application to a multitude of possi bilities is Artificial Intelligence (AI). With the increasing complexity of the problems to be solved, human resolution becomes impossible, as the amount of information and patterns that can be detected is limited, while AI thrives on the dimension of the problem under analysis. Furthermore, as nowadays more and more traditional devices are computerized, an increasing number of elements are producing data that has many potential applications. Consequently, we find ourselves at the height of Big Data, where huge volumes of data are generated, being entirely unfeasible to process and analyze them manually. Additionally, with the increasing complexity of network topologies, it is necessary to ensure the correct func tioning of all equipment, avoiding cascade failures among devices, which can lead to catastrophic consequences depending on their use. Thus, Root Cause Analysis (RCA) tools become fundamental since these are developed to automatically, through rules established by its users, realize the underlying causes when some equipment mal functions. However, with the growing network complexity, the definition of rules becomes exponentially more complicated as the possible points of failure scale drastically. In this context, framed by the Altice Labs RCA and network environment use case, the main objective of this research project is defined. The aim is to use Machine Learning (ML) techniques to extrapolate the relationship between different types of equipment alarms, gathered by the Alarm Manager tool, to have a better understanding of the impact of a failure on the entire system, thus easing and helping the process of manual implementation of RCA rules. As this tool manages millions of daily alarms, it becomes unfeasible to process them manually, making the application of ML essential. Furthermore, ML algorithms have tremendous capabilities to detect patterns that humans could not, ideally exposing which specific failure causes a series of malfunctions, thus allowing system administrators to only focus their attention on the source problem instead of the multiple consequences. The ML approach proposed in this project is based on the causality among alarms, instead of their features, and uses the cartesian product of a specific problem, the involved technology, and the manufacturer, to extrap olate the correlations among faults. The results achieved reveal the tremendous potential of this approach and open the road to automatizing the definition of RCA rules, which represents a new vision on how to manage network failures efficiently.Ao longo dos últimos anos, uma das áreas que mais tem evoluído e estendido a sua utilização para uma infinidade de possibilidades é a Inteligência Artificial (IA). Com a crescente complexidade dos problemas, a resolução humana torna-se impossível, uma vez que a quantidade de informação e padrões que podem ser detectados é limitada, enquanto a IA prospera na dimensão do problema em análise. Além disso, como hoje em dia cada vez mais dispositivos tradicionais são informatizados, um número crescente de elementos está a pro duzir dados com muitas potenciais aplicações. Consequentemente, encontramo-nos no auge do Big Data, onde enormes volumes de dados são gerados, sendo totalmente inviável processá-los e analisá-los manualmente. Esta é uma das razões que tem levado à prosperidade da IA. Além disso, com a crescente complexidade das topologias de rede, é necessário assegurar o correcto fun cionamento de todos os equipamentos, evitando falhas em cascata entre dispositivos, o que pode levar a con sequências catastróficas dependendo da sua utilização. Assim, as ferramentas de Root Cause Analysis (RCA) tornam-se fundamentais, uma vez que são desenvolvidas para, através de regras estabelecidas pelos seus utilizadores, se aperceberem automaticamente das causas subjacentes quando algum equipamento apresenta anomalias. No entanto, com a crescente complexidade da rede, a definição de regras torna-se exponencial mente mais complicada, uma vez que os pontos possíveis de falha escalam tremendamente. Neste contexto, enquadrado pelo ambiente de rede e cenários de RCA da Altice Labs, foi definido o principal objectivo deste projecto de investigação. Este objectivo consiste na aplicação de técnicas de Machine Learning (ML) para extrapolar a relação entre os diferentes tipos de alarmes dos equipamentos, geridos pela ferramenta Alarm Manager, para ter uma melhor compreensão do impacto de uma falha em todo o sistema, facilitando e ajudando assim o processo de implementação manual das regras RCA. Como esta ferramenta gere milhões de alarmes diários, torna-se inviável processá-los manualmente, tornando essencial a aplicação do ML. Além disso, os algoritmos ML têm uma enorme capacidade para detectar padrões que os humanos não conseguem detectar, idealmente expondo quais as falhas específicas que causam uma série de falhas, permitindo assim que os administradores do sistema apenas concentrem a sua atenção no problema de raiz em vez das suas múltiplas consequências. A abordagem ML proposta neste projecto baseia-se na causalidade entre os alarmes, em vez das suas car acterísticas, e utiliza o produto cartesiano de um problema específico, da tecnologia envolvida, e do fabricante, para extrapolar as correlações entre falhas. Os resultados alcançados revelam o enorme potencial desta abor dagem e abrem o caminho para automatizar a definição de regras RCA, o que representa uma nova visão sobre como gerir eficazmente as falhas da rede

    Fast Distributed Simulation for Dependability Analysis of a Cache-based RAID System

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryDARPA/ITO / DABT63-94-C-0045NASA Langley Research Center / NASA NAG 1-613Tandem Computer

    Methods for Advanced Wind Turbine Condition Monitoring and Early Diagnosis: A Literature Review

    Get PDF
    Condition monitoring and early fault diagnosis for wind turbines have become essential industry practice as they help improve wind farm reliability, overall performance and productivity. If not detected and rectified at early stages, some faults can be catastrophic with significant loss or revenue along with interruption to the business relying mainly on wind energy. The failure of Wind turbine results in system downtime and repairing or replacement expenses that significantly reduce the annual income. Such failures call for more systematized operation and maintenance schemes to ensure the reliability of wind energy conversion systems. Condition monitoring and fault diagnosis systems of wind turbine play an important role in reducing maintenance and operational costs and increase system reliability. This paper is aimed at providing the reader with the overall feature for wind turbine condition monitoring and fault diagnosis which includes various potential fault types and locations along with the signals to be analyzed with different signal processing methods

    Automated Anomaly Detection in Virtualized Services Using Deep Packet Inspection

    Get PDF
    Virtualization technologies have proven to be important drivers for the fast and cost-efficient development and deployment of services. While the benefits are tremendous, there are many challenges to be faced when developing or porting services to virtualized infrastructure. Especially critical applications like Virtualized Network Functions must meet high requirements in terms of reliability and resilience. An important tool when meeting such requirements is detecting anomalous system components and recovering the anomaly before it turns into a fault and subsequently into a failure visible to the client. Anomaly detection for virtualized services relies on collecting system metrics that represent the normal operation state of every component and allow the usage of machine learning algorithms to automatically build models representing such state. This paper presents an approach for collecting service-layer metrics while treating services as black-boxes. This allows service providers to implement anomaly detection on the application layer without the need to modify third-party software. Deep Packet Inspection is used to analyse the traffic of virtual machines on the hypervisor layer, producing both generic and protocol-specific communication metrics. An evaluation shows that the resulting metrics represent the normal operation state of an example Virtualized Network Function and are therefore a valuable contribution to automatic anomaly detection in virtualized services

    A support architecture for reliable distributed computing systems

    Get PDF
    The Clouds kernel design was through several design phases and is nearly complete. The object manager, the process manager, the storage manager, the communications manager, and the actions manager are examined

    Monitoring and Failure Recovery of Cloud-Managed Digital Signage

    Get PDF
    Digitaal signage kasutatakse laialdaselt erinevates valdkondades, nagu näiteks transpordisüsteemid, turustusvõimalused, meelelahutus ja teised, et kuvada teavet piltide, videote ja teksti kujul. Nende ressursside usaldusväärsus, vajalike teenuste kättesaadavus ja turvameetmed on selliste süsteemide vastuvõtmisel võtmeroll. Digitaalse märgistussüsteemi tõhus haldamine on teenusepakkujatele keeruline ülesanne. Selle süsteemi rikkeid võib põhjustada mitmeid põhjuseid, nagu näiteks vigased kuvarid, võrgu-, riist- või tarkvaraprobleemid, mis on üsna korduvad. Traditsiooniline protsess sellistest ebaõnnestumistest taastumisel hõlmab sageli tüütuid ja tülikaid diagnoose. Paljudel juhtudel peavad tehnikud kohale füüsiliselt külastama, suurendades seeläbi hoolduskulusid ja taastumisaega.Selles väites pakume lahendust, mis jälgib, diagnoosib ja taandub tuntud tõrgetest, ühendades kuvarid pilvega. Pilvepõhine kaug- ja autonoomne server konfigureerib kaugseadete sisu ja uuendab neid dünaamiliselt. Iga kuva jälgib jooksvat protsessi ja saadab trace’i, logib süstemisse perioodiliselt. Negatiivide puhul analüüsitakse neid serverisse salvestatud logisid, mis optimaalselt kasutavad kohandatud logijuhtimismoodulit. Lisaks näitavad ekraanid ebaõnnestumistega toimetulemiseks enesetäitmise protseduure, kui nad ei suuda pilvega ühendust luua. Kavandatud lahendus viiakse läbi Linuxi süsteemis ja seda hinnatakse serveri kasutuselevõtuga Amazon Web Service (AWS) pilves. Peamisteks tulemusteks on meetodite kogum, mis võimaldavad kaugjuhtimisega kuvariprobleemide lahendamist.Digital signage is widely used in various fields such as transport systems, trading outlets, entertainment, and others, to display information in the form of images, videos, and text. The reliability of these resources, availability of required services and security measures play a key role in the adoption of such systems. Efficient management of the digital signage system is a challenging task to the service providers. There could be many reasons that lead to the malfunctioning of this system such as faulty displays, network, hardware or software failures that are quite repetitive. The traditional process of recovering from such failures often involves tedious and cumbersome diagnosis. In many cases, technicians need to physically visit the site, thereby increasing the maintenance costs and the recovery time. In this thesis, we propose a solution that monitors, diagnoses and recovers from known failures by connecting the displays to a cloud. A cloud-based remote and autonomous server configures the content of remote displays and updates them dynamically. Each display tracks the running process and sends the trace and system logs to the server periodically. These logs, stored at the server optimally using a customized log management module, are analysed for failures. In addition, the displays incorporate self-recovery procedures to deal with failures, when they are unable to create connection to the cloud. The proposed solution is implemented on a Linux system and evaluated by deploying the server on the Amazon Web Service (AWS) cloud. The main result of the thesis is a collection of techniques for resolving the display system failures remotely
    corecore