3,755 research outputs found

    Accumulating regional density dissimilarity for concept drift detection in data streams

    Full text link
    © 2017 Elsevier Ltd In a non-stationary environment, newly received data may have different knowledge patterns from the data used to train learning models. As time passes, a learning model's performance may become increasingly unreliable. This problem is known as concept drift and is a common issue in real-world domains. Concept drift detection has attracted increasing attention in recent years. However, very few existing methods pay attention to small regional drifts, and their accuracy may vary due to differing statistical significance tests. This paper presents a novel concept drift detection method, based on regional-density estimation, named nearest neighbor-based density variation identification (NN-DVI). It consists of three components. The first is a k-nearest neighbor-based space-partitioning schema (NNPS), which transforms unmeasurable discrete data instances into a set of shared subspaces for density estimation. The second is a distance function that accumulates the density discrepancies in these subspaces and quantifies the overall differences. The third component is a tailored statistical significance test by which the confidence interval of a concept drift can be accurately determined. The distance applied in NN-DVI is sensitive to regional drift and has been proven to follow a normal distribution. As a result, the NN-DVI's accuracy and false-alarm rate are statistically guaranteed. Additionally, several benchmarks have been used to evaluate the method, including both synthetic and real-world datasets. The overall results show that NN-DVI has better performance in terms of addressing problems related to concept drift-detection

    Concept drift adaptation for learning with streaming data

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.The term concept drift refers to the change of distribution underlying the data. It is an inherent property of evolving data streams. Concept drift detection and adaptation has been considered an important component of learning under evolving data streams and has attracted increasing attention in recent years. According to the existing literature, the most commonly used definition of concept drift is constrained to discrete feature space. The categorization of concept drift is complicated and has limited contribution to solving concept drift problems. As a result, there is a gap to uniformly describe concept drift for both discrete and continuous feature space, and to be a guideline to addressing the root causes of concept drift. The objective of existing concept drift handling methods mainly focuses on identifying when is the best time to intercept training samples from data streams to construct the cleanest concept. Most only consider concept drift as a time-related distribution change, and are disinterested in the spatial information related to the drift. As a result, if a drift detection or adaptation method does not have spatial information regarding the drift regions, it can only update learning models or their training dataset in terms of time-related information, which may result in an incomplete model update or unnecessary training data reduction. In particular, if a false alarm is raised, updating the entire training set is costly and may degrade the overall performance of the learners. For the same reason, any regional drifts, before becoming globally significant, will not trigger the adaptation process and will result in a delay in the drift detection process. These disadvantages limit the accuracy of machine learning under evolving data streams. To better address concept drift problems, this thesis proposes a novel Regional Drift Adaptation (RDA) framework that introduces spatial-related information into concept drift detection and adaptation. In other words, RDA-based algorithms consider both time-related and spatial information for concept drift handling (concept drift handling includes both drift detection and adaptation). In this thesis, a formal definition of regional drift is given which has theoretically proved that any types of concept drift can be represented as a set of regional drifts. According to these findings, a series of regional drift-oriented drift adaptation algorithms have been developed, including the Nearest Neighbor-based Density Variation Identification (NN-DVI) algorithm which focuses on improving concept drift detection accuracy, the Local Drift Degree-based Density Synchronization Drift Adaptation (LDD-DSDA) algorithm which focuses on boosting the performance of learners with concept drift adaptation, and the online Regional Drift Adaptation (online-RDA) algorithm which incrementally solves concept drift problems quickly and with limited storage requirements. Finally, an extensive evaluation on various benchmarks, consisting of both synthetic and real-world data streams, was conducted. The competitive results underline the effectiveness of RDA in relation to concept drift handling. To conclude, this thesis targets an urgent issue in modern machine learning research. The approach taken in the thesis of building regional concept drift detection and adaptation system is novel. There has previously been no systematic study on handling concept drift from spatial prespective. The findings of this thesis contribute to both scientific research and practical applications

    Multimodal Batch-Wise Change Detection

    Get PDF
    We address the problem of detecting distribution changes in a novel batch-wise and multimodal setup. This setup is characterized by a stationary condition where batches are drawn from potentially different modalities among a set of distributions in Rd represented in the training set. Existing change detection (CD) algorithms assume that there is a unique-possibly multipeaked-distribution characterizing stationary conditions, and in batch-wise multimodal context exhibit either low detection power or poor control of false positives. We present MultiModal QuantTree (MMQT), a novel CD algorithm that uses a single histogram to model the batch-wise multimodal stationary conditions. During testing, MMQT automatically identifies which modality has generated the incoming batch and detects changes by means of a modality-specific statistic. We leverage the theoretical properties of QuantTree to: 1) automatically estimate the number of modalities in a training set and 2) derive a principled calibration procedure that guarantees false-positive control. Our experiments show that MMQT achieves high detection power and accurate control over false positives in synthetic and real-world multimodal CD problems. Moreover, we show the potential of MMQT in Stream Learning applications, where it proves effective at detecting concept drifts and the emergence of novel classes by solely monitoring the input distribution

    FAC-fed: Federated adaptation for fairness and concept drift aware stream classification

    Get PDF
    Federated learning is an emerging collaborative learning paradigm of Machine learning involving distributed and heterogeneous clients. Enormous collections of continuously arriving heterogeneous data residing on distributed clients require federated adaptation of efficient mining algorithms to enable fair and high-quality predictions with privacy guarantees and minimal response delay. In this context, we propose a federated adaptation that mitigates discrimination embedded in the streaming data while handling concept drifts (FAC-Fed). We present a novel adaptive data augmentation method that mitigates client-side discrimination embedded in the data during optimization, resulting in an optimized and fair centralized server. Extensive experiments on a set of publicly available streaming and static datasets confirm the effectiveness of the proposed method. To the best of our knowledge, this work is the first attempt towards fairness-aware federated adaptation for stream classification, therefore, to prove the superiority of our proposed method over state-of-the-art, we compare the centralized version of our proposed method with three centralized stream classification baseline models (FABBOO, FAHT, CSMOTE). The experimental results show that our method outperforms the current methods in terms of both discrimination mitigation and predictive performance

    Discrimination and Class Imbalance Aware Online Naive Bayes

    Full text link
    Fairness-aware mining of massive data streams is a growing and challenging concern in the contemporary domain of machine learning. Many stream learning algorithms are used to replace humans at critical decision-making points e.g., hiring staff, assessing credit risk, etc. This calls for handling massive incoming information with minimum response delay while ensuring fair and high quality decisions. Recent discrimination-aware learning methods are optimized based on overall accuracy. However, the overall accuracy is biased in favor of the majority class; therefore, state-of-the-art methods mainly diminish discrimination by partially or completely ignoring the minority class. In this context, we propose a novel adaptation of Na\"ive Bayes to mitigate discrimination embedded in the streams while maintaining high predictive performance for both the majority and minority classes. Our proposed algorithm is simple, fast, and attains multi-objective optimization goals. To handle class imbalance and concept drifts, a dynamic instance weighting module is proposed, which gives more importance to recent instances and less importance to obsolete instances based on their membership in minority or majority class. We conducted experiments on a range of streaming and static datasets and deduced that our proposed methodology outperforms existing state-of-the-art fairness-aware methods in terms of both discrimination score and balanced accuracy

    Concept Drift Adaptation for Real-time Prediction

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.Concept drift refers to the phenomenon of distribution changes in a data stream. Using concept drift adaptation techniques to predict the target variable(s) of real-time data streams has gained the ever-increasing attention of researchers in recent years. This research aims to develop a set of concept drift adaptation methods for predicting the target variable of real-time data streams. The literature review reveals two issues in the area of concept drift: i) how the concept drift problem limits the learning capability; ii) how to make adaptation in more realistic scenarios that data streams have uncertainties other than concept drift. To address the issue i), this research discovers three root causes of limited learning capability when concept drift occurs. It is found that when concept drift occurs in a data stream, the prediction accuracy is decreased because 1) the training set contains more than one patterns so that the predictor cannot be well-learned; 2) a newly arrived data instance may present old patterns but an old instance presents the new pattern; and 3) few data instances are available when a new concept is identified at its early stage. Three concept drift adaptation methods are designed to address the three situations separately. Situation 1) is solved by developing a - (FUZZ-CARE) approach. FUZZ-CARE can learn how many patterns exist in the training set and the membership degree of each instance belonging to each pattern; To learn the predictor with the most relevant data rather than the newest arrived data, a - (SEGA) method to sequentially pick out the best segments in the training data to update the predictors. This addresses the situation 2). An (AFN) is designed to address the situation 3) through generating samples of the new concept with the previous data instances. To address the issue ii), this research discusses the concept drift phenomenon under two scenarios that are more realistic. One is to solve the concept drift problem when data is noisy. A - (NoA) method is designed for handling concept drift when the data stream contains signal noise; the other is to solve the concept drift problem when data also contains temporal dependency. A theoretical study is conducted for the regression of data streams with concept drift and temporal dependency, and based on this study, a - (DAR) framework is established. To conclude, this thesis not only provides a set of effective drift adaptation methods for real-time prediction, but also contributes to the development of concept drift area

    A survey on detecting healthcare concept drift in AI/ML models from a finance perspective

    Get PDF
    Data is incredibly significant in today's digital age because data represents facts and numbers from our regular life transactions. Data is no longer arriving in a static form; it is now arriving in a streaming fashion. Data streams are the arrival of limitless, continuous, and rapid data. The healthcare industry is a major generator of data streams. Processing data streams is extremely complex due to factors such as volume, pace, and variety. Data stream classification is difficult owing to idea drift. Concept drift occurs in supervised learning when the statistical properties of the target variable that the model predicts change unexpectedly. We focused on solving various forms of concept drift problems in healthcare data streams in this research, and we outlined the existing statistical and machine learning methodologies for dealing with concept drift. It also emphasizes the use of deep learning algorithms for concept drift detection and describes the various healthcare datasets utilized for concept drift detection in data stream categorization

    Real-Time Machine Learning Models To Detect Cyber And Physical Anomalies In Power Systems

    Get PDF
    A Smart Grid is a cyber-physical system (CPS) that tightly integrates computation and networking with physical processes to provide reliable two-way communication between electricity companies and customers. However, the grid availability and integrity are constantly threatened by both physical faults and cyber-attacks which may have a detrimental socio-economic impact. The frequency of the faults and attacks is increasing every year due to the extreme weather events and strong reliance on the open internet architecture that is vulnerable to cyber-attacks. In May 2021, for instance, Colonial Pipeline, one of the largest pipeline operators in the U.S., transports refined gasoline and jet fuel from Texas up the East Coast to New York was forced to shut down after being attacked by ransomware, causing prices to rise at gasoline pumps across the country. Enhancing situational awareness within the grid can alleviate these risks and avoid their adverse consequences. As part of this process, the phasor measurement units (PMU) are among the suitable assets since they collect time-synchronized measurements of grid status (30-120 samples/s), enabling the operators to react rapidly to potential anomalies. However, it is still challenging to process and analyze the open-ended source of PMU data as there are more than 2500 PMU distributed across the U.S. and Canada, where each of which generates more than 1.5 TB/month of streamed data. Further, the offline machine learning algorithms cannot be used in this scenario, as they require loading and scanning the entire dataset before processing. The ultimate objective of this dissertation is to develop early detection of cyber and physical anomalies in a real-time streaming environment setting by mining multi-variate large-scale synchrophasor data. To accomplish this objective, we start by investigating the cyber and physical anomalies, analyzing their impact, and critically reviewing the current detection approaches. Then, multiple machine learning models were designed to identify physical and cyber anomalies; the first one is an artificial neural network-based approach for detecting the False Data Injection (FDI) attack. This attack was specifically selected as it poses a serious risk to the integrity and availability of the grid; Secondly, we extend this approach by developing a Random Forest Regressor-based model which not only detects anomalies, but also identifies their location and duration; Lastly, we develop a real-time hoeffding tree-based model for detecting anomalies in steaming networks, and explicitly handling concept drifts. These models have been tested and the experimental results confirmed their superiority over the state-of-the-art models in terms of detection accuracy, false-positive rate, and processing time, making them potential candidates for strengthening the grid\u27s security
    • …
    corecore