193 research outputs found

    An ensemble based on neural networks with random weights for online data stream regression

    Get PDF
    Most information sources in the current technological world are generating data sequentially and rapidly, in the form of data streams. The evolving nature of processes may often cause changes in data distribution, also known as concept drift, which is difficult to detect and causes loss of accuracy in supervised learning algorithms. As a consequence, online machine learning algorithms that are able to update actively according to possible changes in the data distribution are required. Although many strategies have been developed to tackle this problem, most of them are designed for classification problems. Therefore, in the domain of regression problems, there is a need for the development of accurate algorithms with dynamic updating mechanisms that can operate in a computational time compatible with today’s demanding market. In this article, the authors propose a new bagging ensemble approach based on Neural Network with Random Weights for online data stream regression. The proposed method improves the data prediction accuracy as well as minimises the required computational time compared to a recent algorithm for online data stream regression from literature. The experiments are carried out using four synthetic datasets to evaluate the algorithm's response to concept drift, along with four benchmark datasets from different industries. The results indicate improvement in data prediction accuracy, effectiveness in handling concept drift and much faster updating times compared to the existing available approach. Additionally, the use of Design of Experiments as an effective tool for hyperparameter tuning is demonstrated

    Using Diversity Ensembles with Time Limits to Handle Concept Drift

    Get PDF
    While traditional supervised learning focuses on static datasets, an increasing amount of data comes in the form of streams, where data is continuous and typically processed only once. A common problem with data streams is that the underlying concept we are trying to learn can be constantly evolving. This concept drift has been of interest to researchers the last few years and there is a need for improved machine learning algorithms that are capable of dealing with concept drifts. A promising approach involves using an ensemble of a diverse set of classifiers. The constituent classifiers are re-trained when a concept drift is detected. Decisions regarding the number of classifiers to maintain and the frequency of re-training classifiers are critical factors that determine classification accuracy in the presence of concept drift. This dissertation systematically investigated these issues in order to develop an improved classifier for online ensemble learning. The impact of reducing the time requiring additional ensembles was studied using artificial and real world datasets. Findings from these studies revealed that in many cases the number of time steps additional ensembles are in memory can be reduced without sacrificing prequential accuracy. It was also found that this new ensemble approach performed well in the presence of false concept drift

    A reduced labeled samples (RLS) framework for classification of imbalanced concept-drifting streaming data.

    Get PDF
    Stream processing frameworks are designed to process the streaming data that arrives in time. An example of such data is stream of emails that a user receives every day. Most of the real world data streams are also imbalanced as is in the stream of emails, which contains few spam emails compared to a lot of legitimate emails. The classification of the imbalanced data stream is challenging due to the several reasons: First of all, data streams are huge and they can not be stored in the memory for one time processing. Second, if the data is imbalanced, the accuracy of the majority class mostly dominates the results. Third, data streams are changing over time, and that causes degradation in the model performance. Hence the model should get updated when such changes are detected. Finally, the true labels of the all samples are not available immediately after classification, and only a fraction of the data is possible to get labeled in real world applications. That is because the labeling is expensive and time consuming. In this thesis, a framework for modeling the streaming data when the classes of the data samples are imbalanced is proposed. This framework is called Reduced Labeled Samples (RLS). RLS is a chunk based learning framework that builds a model using partially labeled data stream, when the characteristics of the data change. In RLS, a fraction of the samples are labeled and are used in modeling, and the performance is not significantly different from that of the 100% labeling. RLS maintains an ensemble of classifiers to boost the performance. RLS uses the information from labeled data in a supervised fashion, and also is extended to use the information from unlabeled data in a semi supervised fashion. RLS addresses both binary and multi class partially labeled data stream and the results show the basis of RLS is effective even in the context of multi class classification problems. Overall, the RLS is shown to be an effective framework for processing imbalanced and partially labeled data streams

    Concept Drift Adaptation with Incremental–Decremental SVM

    Get PDF
    Data classification in streams where the underlying distribution changes over time is known to be difficult. This problem—known as concept drift detection—involves two aspects: (i) detecting the concept drift and (ii) adapting the classifier. Online training only considers the most recent samples; they form the so-called shifting window. Dynamic adaptation to concept drift is performed by varying the width of the window. Defining an online Support Vector Machine (SVM) classifier able to cope with concept drift by dynamically changing the window size and avoiding retraining from scratch is currently an open problem. We introduce the Adaptive Incremental–Decremental SVM (AIDSVM), a model that adjusts the shifting window width using the Hoeffding statistical test. We evaluate AIDSVM performance on both synthetic and real-world drift datasets. Experiments show a significant accuracy improvement when encountering concept drift, compared with similar drift detection models defined in the literature. The AIDSVM is efficient, since it is not retrained from scratch after the shifting window slides

    Data driven methods for updating fault detection and diagnosis system in chemical processes

    Get PDF
    Modern industrial processes are becoming more complex, and consequently monitoring them has become a challenging task. Fault Detection and Diagnosis (FDD) as a key element of process monitoring, needs to be investigated because of its essential role in decision making processes. Among available FDD methods, data driven approaches are currently receiving increasing attention because of their relative simplicity in implementation. Regardless of FDD types, one of the main traits of reliable FDD systems is their ability of being updated while new conditions that were not considered at their initial training appear in the process. These new conditions would emerge either gradually or abruptly, but they have the same level of importance as in both cases they lead to FDD poor performance. For addressing updating tasks, some methods have been proposed, but mainly not in research area of chemical engineering. They could be categorized to those that are dedicated to managing Concept Drift (CD) (that appear gradually), and those that deal with novel classes (that appear abruptly). The available methods, mainly, in addition to the lack of clear strategies for updating, suffer from performance weaknesses and inefficient required time of training, as reported. Accordingly, this thesis is mainly dedicated to data driven FDD updating in chemical processes. The proposed schemes for handling novel classes of faults are based on unsupervised methods, while for coping with CD both supervised and unsupervised updating frameworks have been investigated. Furthermore, for enhancing the functionality of FDD systems, some major methods of data processing, including imputation of missing values, feature selection, and feature extension have been investigated. The suggested algorithms and frameworks for FDD updating have been evaluated through different benchmarks and scenarios. As a part of the results, the suggested algorithms for supervised handling CD surpass the performance of the traditional incremental learning in regard to MGM score (defined dimensionless score based on weighted F1 score and training time) even up to 50% improvement. This improvement is achieved by proposed algorithms that detect and forget redundant information as well as properly adjusting the data window for timely updating and retraining the fault detection system. Moreover, the proposed unsupervised FDD updating framework for dealing with novel faults in static and dynamic process conditions achieves up to 90% in terms of the NPP score (defined dimensionless score based on number of the correct predicted class of samples). This result relies on an innovative framework that is able to assign samples either to new classes or to available classes by exploiting one class classification techniques and clustering approaches.Los procesos industriales modernos son cada vez más complejos y, en consecuencia, su control se ha convertido en una tarea desafiante. La detección y el diagnóstico de fallos (FDD), como un elemento clave de la supervisión del proceso, deben ser investigados debido a su papel esencial en los procesos de toma de decisiones. Entre los métodos disponibles de FDD, los enfoques basados en datos están recibiendo una atención creciente debido a su relativa simplicidad en la implementación. Independientemente de los tipos de FDD, una de las principales características de los sistemas FDD confiables es su capacidad de actualización, mientras que las nuevas condiciones que no fueron consideradas en su entrenamiento inicial, ahora aparecen en el proceso. Estas nuevas condiciones pueden surgir de forma gradual o abrupta, pero tienen el mismo nivel de importancia ya que en ambos casos conducen al bajo rendimiento de FDD. Para abordar las tareas de actualización, se han propuesto algunos métodos, pero no mayoritariamente en el área de investigación de la ingeniería química. Podrían ser categorizados en los que están dedicados a manejar Concept Drift (CD) (que aparecen gradualmente), y a los que tratan con clases nuevas (que aparecen abruptamente). Los métodos disponibles, además de la falta de estrategias claras para la actualización, sufren debilidades en su funcionamiento y de un tiempo de capacitación ineficiente, como se ha referenciado. En consecuencia, esta tesis está dedicada principalmente a la actualización de FDD impulsada por datos en procesos químicos. Los esquemas propuestos para manejar nuevas clases de fallos se basan en métodos no supervisados, mientras que para hacer frente a la CD se han investigado los marcos de actualización supervisados y no supervisados. Además, para mejorar la funcionalidad de los sistemas FDD, se han investigado algunos de los principales métodos de procesamiento de datos, incluida la imputación de valores perdidos, la selección de características y la extensión de características. Los algoritmos y marcos sugeridos para la actualización de FDD han sido evaluados a través de diferentes puntos de referencia y escenarios. Como parte de los resultados, los algoritmos sugeridos para el CD de manejo supervisado superan el rendimiento del aprendizaje incremental tradicional con respecto al puntaje MGM (puntuación adimensional definida basada en el puntaje F1 ponderado y el tiempo de entrenamiento) hasta en un 50% de mejora. Esta mejora se logra mediante los algoritmos propuestos que detectan y olvidan la información redundante, así como ajustan correctamente la ventana de datos para la actualización oportuna y el reciclaje del sistema de detección de fallas. Además, el marco de actualización FDD no supervisado propuesto para tratar fallas nuevas en condiciones de proceso estáticas y dinámicas logra hasta 90% en términos de la puntuación de NPP (puntuación adimensional definida basada en el número de la clase de muestras correcta predicha). Este resultado se basa en un marco innovador que puede asignar muestras a clases nuevas o a clases disponibles explotando una clase de técnicas de clasificación y enfoques de agrupamientoPostprint (published version

    Learning in Dynamic Data-Streams with a Scarcity of Labels

    Get PDF
    Analysing data in real-time is a natural and necessary progression from traditional data mining. However, real-time analysis presents additional challenges to batch-analysis; along with strict time and memory constraints, change is a major consideration. In a dynamic stream there is an assumption that the underlying process generating the stream is non-stationary and that concepts within the stream will drift and change over time. Adopting a false assumption that a stream is stationary will result in non-adaptive models degrading and eventually becoming obsolete. The challenge of recognising and reacting to change in a stream is compounded by the scarcity of labels problem. This refers to the very realistic situation in which the true class label of an incoming point is not immediately available (or will never be available) or in situations where manually labelling incoming points is prohibitively expensive. The goal of this thesis is to evaluate unsupervised learning as the basis for online classification in dynamic data-streams with a scarcity of labels. To realise this goal, a novel stream clustering algorithm based on the collective behaviour of ants (Ant Colony Stream Clustering (ACSC)) is proposed. This algorithm is shown to be faster and more accurate than comparative, peer stream-clustering algorithms while requiring fewer sensitive parameters. The principles of ACSC are extended in a second stream-clustering algorithm named Multi-Density Stream Clustering (MDSC). This algorithm has adaptive parameters and crucially, can track clusters and monitor their dynamic behaviour over time. A novel technique called a Dynamic Feature Mask (DFM) is proposed to ``sit on top’’ of these stream-clustering algorithms and can be used to observe and track change at the feature level in a data stream. This Feature Mask acts as an unsupervised feature selection method allowing high-dimensional streams to be clustered. Finally, data-stream clustering is evaluated as an approach to one-class classification and a novel framework (named COCEL: Clustering and One class Classification Ensemble Learning) for classification in dynamic streams with a scarcity of labels is described. The proposed framework can identify and react to change in a stream and hugely reduces the number of required labels (typically less than 0.05% of the entire stream)

    Concept drift from 1980 to 2020: a comprehensive bibliometric analysis with future research insight

    Get PDF
    In nonstationary environments, high-dimensional data streams have been generated unceasingly where the underlying distribution of the training and target data may change over time. These drifts are labeled as concept drift in the literature. Learning from evolving data streams demands adaptive or evolving approaches to handle concept drifts, which is a brand-new research affair. In this effort, a wide-ranging comparative analysis of concept drift is represented to highlight state-of-the-art approaches, embracing the last four decades, namely from 1980 to 2020. Considering the scope and discipline; the core collection of the Web of Science database is regarded as the basis of this study, and 1,564 publications related to concept drift are retrieved. As a result of the classification and feature analysis of valid literature data, the bibliometric indicators are revealed at the levels of countries/regions, institutions, and authors. The overall analyses, respecting the publications, citations, and cooperation of networks, are unveiled not only the highly authoritative publications but also the most prolific institutions, influential authors, dynamic networks, etc. Furthermore, deep analyses including text mining such as; the burst detection analysis, co-occurrence analysis, timeline view analysis, and bibliographic coupling analysis are conducted to disclose the current challenges and future research directions. This paper contributes as a remarkable reference for invaluable further research of concept drift, which enlightens the emerging/trend topics, and the possible research directions with several graphs, visualized by using the VOS viewer and Cite Space software
    corecore