8 research outputs found

    Tpda2 Algorithm for Learning Bn Structure From Missing Value and Outliers in Data Mining

    Full text link
    Three-Phase Dependency Analysis (TPDA) algorithm was proved as most efficient algorithm (which requires at most O(N4) Conditional Independence (CI) tests). By integrating TPDA with "node topological sort algorithm", it can be used to learn Bayesian Network (BN) structure from missing value (named as TPDA1 algorithm). And then, outlier can be reduced by applying an "outlier detection & removal algorithm" as pre-processing for TPDA1. TPDA2 algorithm proposed consists of those ideas, outlier detection & removal, TPDA, and node topological sort node

    The GC3 framework : grid density based clustering for classification of streaming data with concept drift.

    Get PDF
    Data mining is the process of discovering patterns in large sets of data. In recent years there has been a paradigm shift in how the data is viewed. Instead of considering the data as static and available in databases, data is now regarded as a stream as it continuously flows into the system. One of the challenges posed by the stream is its dynamic nature, which leads to a phenomenon known as Concept Drift. This causes a need for stream mining algorithms which are adaptive incremental learners capable of evolving and adjusting to the changes in the stream. Several models have been developed to deal with Concept Drift. These systems are discussed in this thesis and a new system, the GC3 framework is proposed. The GC3 framework leverages the advantages of the Gris Density based Clustering and the Ensemble based classifiers for streaming data, to be able to detect the cause of the drift and deal with it accordingly. In order to demonstrate the functionality and performance of the framework a synthetic data stream called the TJSS stream is developed, which embodies a variety of drift scenarios, and the model’s behavior is analyzed over time. Experimental evaluation with the synthetic stream and two real world datasets demonstrated high prediction capability of the proposed system with a small ensemble size and labeling ratio. Comparison of the methodology with a traditional static model with no drifts detection capability and with existing ensemble techniques for stream classification, showed promising results. Also, the analysis of data structures maintained by the framework provided interpretability into the dynamics of the drift over time. The experimentation analysis of the GC3 framework shows it to be promising for use in dynamic drifting environments where concepts can be incrementally learned in the presence of only partially labeled data

    Statistische und Probabilistische Methoden fĂĽr Data Stream Mining

    Get PDF
    The aim of this work is not only to highlight and summarize issues and challenges which arose during the mining of data streams, but also to find possible solutions to illustrated problems. Due to the streaming nature of the data, it is impossible to hold the whole data set in the main memory, i.e. efficient on-line computations are needed. For instance incremental calculations could be used in order to avoid to start the computation process from scratch each time new data arrive and to save memory. Another important aspect in data stream analysis is that the data generating process does not remain static, i.e.\ the underlying probabilistic model cannot be assumed to be stationary. The changes in the data structure may occur over time. Dealing with non-stationary data requires change detection and on-line adaptation. Furthermore real data is often contaminated with noise, this causes a specific problem for approaches dealing with the data streams. They must be able to distinguish between changes according to noise and changes of the underlying data generating process or its parameters. In this work we propose a variety of different methods, which fulfil specific requirements of data stream mining. Furthermore we carry out theoretical analysis of effects of noise and changes in data stream for sliding window based evolving system in order to illustrate the problem of suboptimal window size. In order to do the validation of an evolving system significant, we propose some simple benchmark tests that can give an idea of how much an evolving system might be misled by noise.Das Hauptziel dieser Arbeit ist es, zentrale Probleme und wichtige Aspekte im Datastream-Mining zu veranschaulichen und mögliche Lösungen zu diskutierten Problemen vorzustellen. Da die Anzahl der Daten bei Datastreams potenziell unendlich ist und die statistischen Eigenschaften der Daten sich mit der Zeit ändern können, lassen sich klassische Data-Mining- und Statistikmethoden nicht auf Data Streams direkt anwenden. Aus diesem Grund werden im Rahmen dieser Arbeit bereits existierende Ansätze an die Datastream-Problematik angepasst und neue Methoden entwickelt. Zum Beispiel werden inkrementelle oder rekursive Berechnungen statistischer Parameter und statistischer Tests vorgestellt, die nötig sind, um Berechnungen online und auf Hardware wie Steuergeräten mit teilweise recht begrenzter Rechen und Speicherkapazität ausführen zu können. Ein wesentliches Problem stellt die Unterscheidung zwischen zufälligen Schwankungen im Sinne von Rauschen und echten Änderungen in Datastreams dar. Es bietet sich an, Hypothesentests mit inkrementeller Berechnung für dieses Problem der Change Detection einzusetzen. In dieser Arbeit werden inkrementelle und auf Fenstertechnik basierende statistische Tests für Change Detection vorgestellt. Die Mehrzahl der existierenden Algorithmen zum Datastream-Mining verwenden keine expliziten Methoden zur Change Detection, sondern benutzen für die Vorhersage gleitende Fenster fester Breite. Nur wenige dieser Methoden beschäftigen sich mit der Frage wie die Fenstergröße ausgewählt werden soll und welche Effekte Veränderungen in den Daten auf die Vorhersagequalität haben. Hierzu wird eine theoretische Analyse für die optimale Fensterbreite für zwei Datenmodelle durchgeführt und gezeigt, dass eine suboptimale Fenstergröße zur drastischen Senkung der Vorhersagequalität führen kann. Außerdem können die vorgestellten Datenmodelle als Benchmark Tests für fensterbasierte Ansätze verwendet werden. Dies kann einen Eindruck vermitteln, wie stark ein sich an Datastreams automatisch anpassendes "Evolving System" durch Rauschen in den Daten negativ beeinflusst wird

    On robust and adaptive soft sensors.

    Get PDF
    In process industries, there is a great demand for additional process information such as the product quality level or the exact process state estimation. At the same time, there is a large amount of process data like temperatures, pressures, etc. measured and stored every moment. This data is mainly measured for process control and monitoring purposes but its potential reaches far beyond these applications. The task of soft sensors is the maximal exploitation of this potential by extracting and transforming the latent information from the data into more useful process knowledge. Theoretically, achieving this goal should be straightforward since the process data as well as the tools for soft sensor development in the form of computational learning methods, are both readily available. However, contrary to this evidence, there are still several obstacles which prevent soft sensors from broader application in the process industry. The identification of the sources of these obstacles and proposing a concept for dealing with them is the general purpose of this work. The proposed solution addressing the issues of current soft sensors is a conceptual architecture for the development of robust and adaptive soft sensing algorithms. The architecture reflects the results of two review studies that were conducted during this project. The first one focuses on the process industry aspects of soft sensor development and application. The main conclusions of this study are that soft sensor development is currently being done in a non-systematic, ad-hoc way which results in a large amount of manual work needed for their development and maintenance. It is also found that a large part of the issues can be related to the process data upon which the soft sensors are built. The second review study dealt with the same topic but this time it was biased towards the machine learning viewpoint. The review focused on the identification of machine learning tools, which support the goals of this work. The machine learning concepts which are considered are: (i) general regression techniques for building of soft sensors; (ii) ensemble methods; (iii) local learning; (iv) meta-learning; and (v) concept drift detection and handling. The proposed architecture arranges the above techniques into a three-level hierarchy, where the actual prediction-making models operate at the bottom level. Their predictions are flexibly merged by applying ensemble methods at the next higher level. Finally from the top level, the underlying algorithm is managed by means of metalearning methods. The architecture has a modular structure that allows new pre-processing, predictive or adaptation methods to be plugged in. Another important property of the architecture is that each of the levels can be equipped with adaptation mechanisms, which aim at prolonging the lifetime of the resulting soft sensors. The relevance of the architecture is demonstrated by means of a complex soft sensing algorithm, which can be seen as its instance. This algorithm provides mechanisms for autonomous selection of data preprocessing and predictive methods and their parameters. It also includes five different adaptation mechanisms, some of which can be applied on a sample-by-sample basis without any requirement to store the on-line data. Other, more complex ones are started only on-demand if the performance of the soft sensor drops below a defined level. The actual soft sensors are built by applying the soft sensing algorithm to three industrial data sets. The different application scenarios aim at the analysis of the fulfilment of the defined goals. It is shown that the soft sensors are able to follow changes in dynamic environment and keep a stable performance level by exploiting the implemented adaptation mechanisms. It is also demonstrated that, although the algorithm is rather complex, it can be applied to develop simple and transparent soft sensors. In another experiment, the soft sensors are built without any manual model selection or parameter tuning, which demonstrates the ability of the algorithm to reduce the effort required for soft sensor development. However, if desirable, the algorithm is at the same time very flexible and provides a number of parameters that can be manually optimised. Evidence of the ability of the algorithm to deploy soft sensors with minimal training data and as such to provide the possibility to save the time consuming and costly training data collection is also given in this work

    On robust and adaptive soft sensors

    Get PDF
    In process industries, there is a great demand for additional process information such as the product quality level or the exact process state estimation. At the same time, there is a large amount of process data like temperatures, pressures, etc. measured and stored every moment. This data is mainly measured for process control and monitoring purposes but its potential reaches far beyond these applications. The task of soft sensors is the maximal exploitation of this potential by extracting and transforming the latent information from the data into more useful process knowledge. Theoretically, achieving this goal should be straightforward since the process data as well as the tools for soft sensor development in the form of computational learning methods, are both readily available. However, contrary to this evidence, there are still several obstacles which prevent soft sensors from broader application in the process industry. The identification of the sources of these obstacles and proposing a concept for dealing with them is the general purpose of this work. The proposed solution addressing the issues of current soft sensors is a conceptual architecture for the development of robust and adaptive soft sensing algorithms. The architecture reflects the results of two review studies that were conducted during this project. The first one focuses on the process industry aspects of soft sensor development and application. The main conclusions of this study are that soft sensor development is currently being done in a non-systematic, ad-hoc way which results in a large amount of manual work needed for their development and maintenance. It is also found that a large part of the issues can be related to the process data upon which the soft sensors are built. The second review study dealt with the same topic but this time it was biased towards the machine learning viewpoint. The review focused on the identification of machine learning tools, which support the goals of this work. The machine learning concepts which are considered are: (i) general regression techniques for building of soft sensors; (ii) ensemble methods; (iii) local learning; (iv) meta-learning; and (v) concept drift detection and handling. The proposed architecture arranges the above techniques into a three-level hierarchy, where the actual prediction-making models operate at the bottom level. Their predictions are flexibly merged by applying ensemble methods at the next higher level. Finally from the top level, the underlying algorithm is managed by means of metalearning methods. The architecture has a modular structure that allows new pre-processing, predictive or adaptation methods to be plugged in. Another important property of the architecture is that each of the levels can be equipped with adaptation mechanisms, which aim at prolonging the lifetime of the resulting soft sensors. The relevance of the architecture is demonstrated by means of a complex soft sensing algorithm, which can be seen as its instance. This algorithm provides mechanisms for autonomous selection of data preprocessing and predictive methods and their parameters. It also includes five different adaptation mechanisms, some of which can be applied on a sample-by-sample basis without any requirement to store the on-line data. Other, more complex ones are started only on-demand if the performance of the soft sensor drops below a defined level. The actual soft sensors are built by applying the soft sensing algorithm to three industrial data sets. The different application scenarios aim at the analysis of the fulfilment of the defined goals. It is shown that the soft sensors are able to follow changes in dynamic environment and keep a stable performance level by exploiting the implemented adaptation mechanisms. It is also demonstrated that, although the algorithm is rather complex, it can be applied to develop simple and transparent soft sensors. In another experiment, the soft sensors are built without any manual model selection or parameter tuning, which demonstrates the ability of the algorithm to reduce the effort required for soft sensor development. However, if desirable, the algorithm is at the same time very flexible and provides a number of parameters that can be manually optimised. Evidence of the ability of the algorithm to deploy soft sensors with minimal training data and as such to provide the possibility to save the time consuming and costly training data collection is also given in this work.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    An adaptive learning approach for noisy data streams

    No full text
    Two critical challenges typically associated with mining data streams are concept drift and data contamination. To address these challenges, we seek learning techniques and models that are robust to noise and can adapt to changes in timely fashion. We approach the stream-mining problem using a statistical estimation framework, and propose a fast and robust discriminative model for learning noisy data streams. We build an ensemble of classifiers to achieve timely adaptation by weighting classifiers in a way that maximizes the likelihood of the data. We further employ robust statistical techniques to alleviate the problem of noise sensitivity. Experimental results on both synthetic and real-life data sets demonstrate the effectiveness of this new model learning approach. 1
    corecore