1,232 research outputs found

    Concept Drift Adaptation for Real-time Prediction

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.Concept drift refers to the phenomenon of distribution changes in a data stream. Using concept drift adaptation techniques to predict the target variable(s) of real-time data streams has gained the ever-increasing attention of researchers in recent years. This research aims to develop a set of concept drift adaptation methods for predicting the target variable of real-time data streams. The literature review reveals two issues in the area of concept drift: i) how the concept drift problem limits the learning capability; ii) how to make adaptation in more realistic scenarios that data streams have uncertainties other than concept drift. To address the issue i), this research discovers three root causes of limited learning capability when concept drift occurs. It is found that when concept drift occurs in a data stream, the prediction accuracy is decreased because 1) the training set contains more than one patterns so that the predictor cannot be well-learned; 2) a newly arrived data instance may present old patterns but an old instance presents the new pattern; and 3) few data instances are available when a new concept is identified at its early stage. Three concept drift adaptation methods are designed to address the three situations separately. Situation 1) is solved by developing a - (FUZZ-CARE) approach. FUZZ-CARE can learn how many patterns exist in the training set and the membership degree of each instance belonging to each pattern; To learn the predictor with the most relevant data rather than the newest arrived data, a - (SEGA) method to sequentially pick out the best segments in the training data to update the predictors. This addresses the situation 2). An (AFN) is designed to address the situation 3) through generating samples of the new concept with the previous data instances. To address the issue ii), this research discusses the concept drift phenomenon under two scenarios that are more realistic. One is to solve the concept drift problem when data is noisy. A - (NoA) method is designed for handling concept drift when the data stream contains signal noise; the other is to solve the concept drift problem when data also contains temporal dependency. A theoretical study is conducted for the regression of data streams with concept drift and temporal dependency, and based on this study, a - (DAR) framework is established. To conclude, this thesis not only provides a set of effective drift adaptation methods for real-time prediction, but also contributes to the development of concept drift area

    A fuzzy kernel c-means clustering model for handling concept drift in regression

    Full text link
    © 2017 IEEE. Concept drift, given the huge volume of high-speed data streams, requires traditional machine learning models to be self-adaptive. Techniques to handle drift are especially needed in regression cases for a wide range of applications in the real world. There is, however, a shortage of research on drift adaptation for regression cases in the literature. One of the main obstacles to further research is the resulting model complexity when regression methods and drift handling techniques are combined. This paper proposes a self-adaptive algorithm, based on a fuzzy kernel c-means clustering approach and a lazy learning algorithm, called FKLL, to handle drift in regression learning. Using FKLL, drift adaptation first updates the learning set using lazy learning, then fuzzy kernel c-means clustering is used to determine the most relevant learning set. Experiments show that the FKLL algorithm is better able to respond to drift as soon as the learning sets are updated, and is also suitable for dealing with reoccurring drift, when compared to the original lazy learning algorithm and other state-of-the-art regression methods

    SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

    Get PDF
    The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered \de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to di erent type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several di erent domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also signi cantly contributed to new supervised learning paradigms, including multilabel classi cation, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of di erent software packages | from open source to commercial. In this paper, marking the fteen year anniversary of SMOTE, we re ect on the SMOTE journey, discuss the current state of a airs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project 887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016; and the National Science Foundation (NSF) Grant IIS-1447795

    IoT Data Analytics in Dynamic Environments: From An Automated Machine Learning Perspective

    Full text link
    With the wide spread of sensors and smart devices in recent years, the data generation speed of the Internet of Things (IoT) systems has increased dramatically. In IoT systems, massive volumes of data must be processed, transformed, and analyzed on a frequent basis to enable various IoT services and functionalities. Machine Learning (ML) approaches have shown their capacity for IoT data analytics. However, applying ML models to IoT data analytics tasks still faces many difficulties and challenges, specifically, effective model selection, design/tuning, and updating, which have brought massive demand for experienced data scientists. Additionally, the dynamic nature of IoT data may introduce concept drift issues, causing model performance degradation. To reduce human efforts, Automated Machine Learning (AutoML) has become a popular field that aims to automatically select, construct, tune, and update machine learning models to achieve the best performance on specified tasks. In this paper, we conduct a review of existing methods in the model selection, tuning, and updating procedures in the area of AutoML in order to identify and summarize the optimal solutions for every step of applying ML algorithms to IoT data analytics. To justify our findings and help industrial users and researchers better implement AutoML approaches, a case study of applying AutoML to IoT anomaly detection problems is conducted in this work. Lastly, we discuss and classify the challenges and research directions for this domain.Comment: Published in Engineering Applications of Artificial Intelligence (Elsevier, IF:7.8); Code/An AutoML tutorial is available at Github link: https://github.com/Western-OC2-Lab/AutoML-Implementation-for-Static-and-Dynamic-Data-Analytic

    Data Stream Clustering: A Review

    Full text link
    Number of connected devices is steadily increasing and these devices continuously generate data streams. Real-time processing of data streams is arousing interest despite many challenges. Clustering is one of the most suitable methods for real-time data stream processing, because it can be applied with less prior information about the data and it does not need labeled instances. However, data stream clustering differs from traditional clustering in many aspects and it has several challenging issues. Here, we provide information regarding the concepts and common characteristics of data streams, such as concept drift, data structures for data streams, time window models and outlier detection. We comprehensively review recent data stream clustering algorithms and analyze them in terms of the base clustering technique, computational complexity and clustering accuracy. A comparison of these algorithms is given along with still open problems. We indicate popular data stream repositories and datasets, stream processing tools and platforms. Open problems about data stream clustering are also discussed.Comment: Has been accepted for publication in Artificial Intelligence Revie

    Data-based fault-tolerant model predictive controller an application to a complex dearomatization process

    Get PDF
    The tightening global competition during the last few decades has been the driving force for the optimisation of industrial plant operations through the use of advanced control methods, such as model predictive control (MPC). As the occurrence of faults in the process measurements and actuators has become more common due to the increase in the complexity of the control systems, the need for fault-tolerant control (FTC) to prevent the degradation of the controller performance, and therefore the better optimisation of the plant operations, has increased. Traditionally, the most actively studied fault detection and diagnosis (FDD) components of the FTC strategies have been based on model-based approaches. In the modern process industries, however, there is a need for the data-based FDD components due to the complexity and limited availability of mechanistic models. Recently, active FTC strategies using fault accommodation and controller reconfiguration have become popular due to the increased computation capacity, easier adaptability and lower overall implementation costs of the active FTC strategies. The main focus of this thesis is on the development of an active data-based fault-tolerant MPC (FTMPC) for an industrial dearomatization process. Three different parallel-running FTC strategies are developed that utilise the data-based FDD methods and the fault accommodation- and controller reconfiguration-based FTC methods. The performances of three data-based FDD methods are first compared within an acknowledged testing environment. Based on the preliminary performance testing, the best FDD method is selected for the final FTMPC. Next, the performance of the FTMPC is validated with the simulation model of the industrial dearomatization process and finally, the profitability of the FTMPC is evaluated based on the results of the evaluation. According to the testing, the FTMPC performs efficiently and detects and prevents the effects of the most common faults in the analyser, flow and temperature measurements, and the controller actuators. The reliability of the MPC is increased and the profitability of the dearomatization process is enhanced due to the lower off-spec production
    corecore