844 research outputs found

    Sequential pattern mining with uncertain data

    Get PDF
    In recent years, a number of emerging applications, such as sensor monitoring systems, RFID networks and location based services, have led to the proliferation of uncertain data. However, traditional data mining algorithms are usually inapplicable in uncertain data because of its probabilistic nature. Uncertainty has to be carefully handled; otherwise, it might significantly downgrade the quality of underlying data mining applications. Therefore, we extend traditional data mining algorithms into their uncertain versions so that they still can produce accurate results. In particular, we use a motivating example of sequential pattern mining to illustrate how to incorporate uncertain information in the process of data mining. We use possible world semantics to interpret two typical types of uncertainty: the tuple-level existential uncertainty and the attribute-level temporal uncertainty. In an uncertain database, it is probabilistic that a pattern is frequent or not; thus, we define the concept of probabilistic frequent sequential patterns. And various algorithms are designed to mine probabilistic frequent patterns efficiently in uncertain databases. We also implement our algorithms on distributed computing platforms, such as MapReduce and Spark, so that they can be applied in large scale databases. Our work also includes uncertainty computation in supervised machine learning algorithms. We develop an artificial neural network to classify numeric uncertain data; and a Naive Bayesian classifier is designed for classifying categorical uncertain data streams. We also propose a discretization algorithm to pre-process numerical uncertain data, since many classifiers work with categoric data only. And experimental results in both synthetic and real-world uncertain datasets demonstrate that our methods are effective and efficient

    A Study on Data Filtering Techniques for Event-Driven Failure Analysis

    Get PDF
    Engineering & Systems DesignHigh performance sensors and modern data logging technology with real-time telemetry facilitate system failure analysis in a very precise manner. Fault detection, isolation and identification in failure analysis are typical steps to analyze the root causes of failures. This systematic failure analysis provides not only useful clues to rectify the abnormal behaviors of a system, but also key information to redesign the current system for retrofit. The main barriers to effective failure analysis are: (i) the gathered sensor data logs, usually in the form of event logs containing massive datasets, are too large, and further (ii) noise and redundant information in the gathered sensor data that make precise analysis difficult. Therefore, the objective of this thesis is to develop an event-driven failure analysis method in order to take into account both functional interactions between subsystems and diverse user???s behaviors. To do this, we first apply various data filtering techniques to data cleaning and reduction, and then convert the filtered data into a new format of event sequence information (called ???eventization???). Four eventization strategies: equal-width binning, entropy, domain knowledge expert, and probability distribution estimation, are examined for data filtering, in order to extract only important information from the raw sensor data while minimizing information loss. By numerical simulation, we identify the optimal values of eventization parameters. Finally, the event sequence information containing the time gap between event occurrences is decoded to investigate the correlation between specific event sequence patterns and various system failures. These extracted patterns are stored in a failure pattern library, and then this pattern library is used as the main reference source to predict failures in real-time during the failure prognosis phase. The efficiency of the developed procedure is examined with a terminal box data log of marine diesel engines.ope

    Proceedings of the Fifth Workshop on Information Theoretic Methods in Science and Engineering

    Get PDF
    These are the online proceedings of the Fifth Workshop on Information Theoretic Methods in Science and Engineering (WITMSE), which was held in the Trippenhuis, Amsterdam, in August 2012

    Machine Learning for the Early Detection of Acute Episodes in Intensive Care Units

    Get PDF
    In Intensive Care Units (ICUs), mere seconds might define whether a patient lives or dies. Predictive models capable of detecting acute events in advance may allow for anticipated interventions, which could mitigate the consequences of those events and promote a greater number of lives saved. Several predictive models developed for this purpose have failed to meet the high requirements of ICUs. This might be due to the complexity of anomaly prediction tasks, and the inefficient utilization of ICU data. Moreover, some essential intensive care demands, such as continuous monitoring, are often not considered when developing these solutions, making them unfit to real contexts. This work approaches two topics within the mentioned problem: the relevance of ICU data used to predict acute episodes and the benefits of applying Layered Learning (LL) techniques to counter the complexity of these tasks. The first topic was undertaken through a study on the relevance of information retrieved from physiological signals and clinical data for the early detection of Acute Hypotensive Episodes (AHE) in ICUs. Then, the potentialities of LL were accessed through an in-depth analysis of the applicability of a recently proposed approach on the same topic. Furthermore, different optimization strategies enabled by LL configurations were proposed, including a new approach aimed at false alarm reduction. The results regarding data relevance might contribute to a shift in paradigm in terms of information retrieved for AHE prediction. It was found that most of the information commonly used in the literature might be wrongly perceived as valuable, since only three features related to blood pressure measures presented actual distinctive traits. On another note, the different LL-based strategies developed confirm the versatile possibilities offered by this paradigm. Although these methodologies did not promote significant performance improvements in this specific context, they can be further explored and adapted to other domains.Em Unidades de Cuidados Intensivos (UCIs), meros segundos podem ser o fator determinante entre a vida e a morte de um paciente. Modelos preditivos para a previsão de eventos adversos podem promover intervenções antecipadas, com vista à mitigação das consequências destes eventos, e traduzir-se num maior número de vidas salvas. Múltiplos modelos desenvolvidos para este propósito não corresponderam às exigências das UCIs. Isto pode dever-se à complexidade de tarefas de previsão de anomalias e à ineficiência no uso da informação gerada em UCIs. Além disto, algumas necessidades inerentes à provisão de cuidados intensivos, tais como a monitorização contínua, são muitas vezes ignoradas no desenvolvimento destas soluções, tornando-as desadequadas para contextos reais. Este projeto aborda dois tópicos dentro da problemática introduzida, nomeadamente a relevância da informação usada para prever episódios agudos, e os benefícios de técnicas de Aprendizagem em Camadas (AC) para contrariar a complexidade destas tarefas. Numa primeira fase, foi conduzido um estudo sobre o impacto de diversos sinais fisiológicos e dados clínicos no contexto da previsão de episódios agudos de hipotensão. As potencialidades do paradigma de AC foram avaliadas através da análise de uma abordagem proposta recentemente para o mesmo caso de estudo. Nesta segunda fase, diversas estratégias de otimização compatíveis com configurações em camadas foram desenvolvidas, incluindo um modelo para reduzir falsos alarmes. Os resultados relativos à relevância da informação podem contribuir para uma mudança de paradigma em termos da informação usada para treinar estes modelos. A maior parte da informação poderá estar a ser erroneamente considerada como importante, uma vez que apenas três variáveis, deduzidas dos valores de pressão arterial, foram identificadas como realmente impactantes. Por outro lado, as diferentes estratégias baseadas em AC confirmaram a versatilidade oferecida por este paradigma. Apesar de não terem promovido melhorias significativas neste contexto, estes métodos podem ser adaptados a outros domínios

    Using root cause analysis to handle intrusion detection alarms

    Get PDF
    Aufgrund einer kontinuierlich steigenden Anzahl von Hacker-Angriffen auf die Informationssysteme von Firmen und Institutionen haben Intrusion Detection Systeme als eine neue Sicherheitstechnologie an Bedeutung gewonnen. Diese Systeme überwachen Computer, Netzwerke sowie andere Ressourcen und erzeugen Alarme, wenn Sicherheitsverletzungen entdeckt werden. Leider erzeugen die heutigen Intrusion Detection Systeme im Allgemeinen sehr viele zumeist falsche Alarme. Dies wirft das Problem auf, wie mit dieser Flut falscher Alarme umzugehen ist. Die vorliegende Dissertation präsentiert einen neuen Lösungsansatz für dieses Problem.Von zentraler Bedeutung für diesen Lösungsansatz ist die Vorstellung, dass jeder Alarm eine eindeutige Ursache besitzt. Diese Dissertation macht die Beobachtung, dass ein paar Dutzend Ursachen für über 90% der Alarme verantwortlich sind. Auf diese Beobachtung aufbauend, wird folgende zweistufige Methode für den Umgang mit Intrusion Detection Alarmen vorgeschlagen: Der erste Schritt identifiziert Ursachen, die viele Alarme erzeugen, und der zweite Schritt entfernt diese Ursachen, wodurch die zukünftige Alarmlast zumeist stark gesenkt wird.Alternativ können Alarme, die eine nicht sicherheitsrelevante Ursache besitzen, durch Filter automatisch entfernt werden. Um das Aufdecken von Alarmursachen zu unterstützen, stellen wir eine neue Data Mining Methode zum Clustern von Alarmen vor. Die Grundlage für diese Methode besteht darin, dass sich die meisten Ursachen in Alarmgruppen mit charakteristischen strukturellen Eigenschaften manifestieren. Wir formalisieren diese strukturellen Eigenschaften und stellen eine Clustering Methode vor, die Alarmgruppen mit diesen Eigenschaften findet. Im Allgemeinen ermöglichen es solche Alarmgruppen, die zugrunde liegenden Alarmursachen zu identifizieren. Daran anschließend können die identifizierten Ursachen eliminiert oder falsche Alarme herausgefiltert werden. In beiden Fällen sinkt die Zahl der Alarme, die in Zukunft noch ausgewertet werden müssen.Die vorgestellte Methode zum Umgang mit Alarmen wird in Experimenten mit Alarmen aus 16 verschiedenen Intrusion Detection Installationen getestet. Diese Experimente bestätigen, dass es die beschriebene Alarm Clustering Methode sehr einfach macht Ursachen aufzudecken. Außerdem zeigen die Experimente, dass die Alarmlast um durchschnittlich 70% gesenkt werden kann, wenn auf die identifizierten Alarmursachen in angemessener Weise reagiert wird

    Acute myocardial infarction patient data to assess healthcare utilization and treatments.

    Get PDF
    The goal of this study is to use a data mining framework to assess the three main treatments for acute myocardial infarction: thrombolytic therapy, percutaneous coronary intervention (percutaneous angioplasty), and coronary artery bypass surgery. The need for a data mining framework in this study arises because of the use of real world data rather than highly clean and homogenous data found in most clinical trials and epidemiological studies. The assessment is based on determining a profile of patients undergoing an episode of acute myocardial infarction, determine resource utilization by treatment, and creating a model that predicts each treatment resource utilization and cost. Text Mining is used to find a subset of input attributes that characterize subjects who undergo the different treatments for acute myocardial infarction as well as distinct resource utilization profiles. Classical statistical methods are used to evaluate the results of text clustering. The features selected by supervised learning are used to build predictive models for resource utilization and are compared with those features selected by traditional statistical methods for a predictive model with the same outcome. Sequence analysis is used to determine the sequence of treatment of acute myocardial infarction. The resulting sequence is used to construct a probability tree that defines the basis for cost effectiveness analysis that compares acute myocardial infarction treatments. To determine effectiveness, survival analysis methodology is implemented to assess the occurrence of death during the hospitalization, the likelihood of a repeated episode of acute myocardial infarction, and the length of time between reoccurrence of an episode of acute myocardial infarction or the occurrence of death. The complexity of this study was mainly based on the data source used: administrative data from insurance claims. Such data source was not originally designed for the study of health outcomes or health resource utilization. However, by transforming record tables from many-to-many relations to one-to-one relations, they became useful in tracking the evolution of disease and disease outcomes. Also, by transforming tables from a wide-format to a long-format, the records became analyzable by many data mining algorithms. Moreover, this study contributed to field of applied mathematics and public health by implementing a sequence analysis on consecutive procedures to determine the sequence of events that describe the evolution of a hospitalization for acute myocardial infarction. This same data transformation and algorithm can be used in the study of rare diseases whose evolution is not well understood
    corecore