944 research outputs found
Autoencoder-based Anomaly Detection in Streaming Data with Incremental Learning and Concept Drift Adaptation
In our digital universe nowadays, enormous amount of data are produced in a
streaming manner in a variety of application areas. These data are often
unlabelled. In this case, identifying infrequent events, such as anomalies,
poses a great challenge. This problem becomes even more difficult in
non-stationary environments, which can cause deterioration of the predictive
performance of a model. To address the above challenges, the paper proposes an
autoencoder-based incremental learning method with drift detection
(strAEm++DD). Our proposed method strAEm++DD leverages on the advantages of
both incremental learning and drift detection. We conduct an experimental study
using real-world and synthetic datasets with severe or extreme class imbalance,
and provide an empirical analysis of strAEm++DD. We further conduct a
comparative study, showing that the proposed method significantly outperforms
existing baseline and advanced methods.Comment: anomaly detection, concept drift, incremental anomaly detection,
concept drift, incremental learning, autoencoders, data streams, class
imbalance, nonstationary environment
Incremental learning of concept drift from imbalanced data
Learning data sampled from a nonstationary distribution has been shown to be a very challenging problem in machine learning, because the joint probability distribution between the data and classes evolve over time. Thus learners must adapt their knowledge base, including their structure or parameters, to remain as strong predictors. This phenomenon of learning from an evolving data source is akin to learning how to play a game while the rules of the game are changed, and it is traditionally referred to as learning concept drift. Climate data, financial data, epidemiological data, spam detection are examples of applications that give rise to concept drift problems. An additional challenge arises when the classes to be learned are not represented (approximately) equally in the training data, as most machine learning algorithms work well only when the class distributions are balanced. However, rare categories are commonly faced in real-world applications, which leads to skewed or imbalanced datasets. Fraud detection, rare disease diagnosis, anomaly detection are examples of applications that feature imbalanced datasets, where data from category are severely underrepresented. Concept drift and class imbalance are traditionally addressed separately in machine learning, yet data streams can experience both phenomena. This work introduces Learn++.NIE (nonstationary & imbalanced environments) and Learn++.CDS (concept drift with SMOTE) as two new members of the Learn++ family of incremental learning algorithms that explicitly and simultaneously address the aforementioned phenomena. The former addresses concept drift and class imbalance through modified bagging-based sampling and replacing a class independent error weighting mechanism - which normally favors majority class - with a set of measures that emphasize good predictive accuracy on all classes. The latter integrates Learn++.NSE, an algorithm for concept drift, with the synthetic sampling method known as SMOTE, to cope with class imbalance. This research also includes a thorough evaluation of Learn++.CDS and Learn++.NIE on several real and synthetic datasets and on several figures of merit, showing that both algorithms are able to learn in some of the most difficult learning environments
Request-and-Reverify: Hierarchical Hypothesis Testing for Concept Drift Detection with Expensive Labels
One important assumption underlying common classification models is the
stationarity of the data. However, in real-world streaming applications, the
data concept indicated by the joint distribution of feature and label is not
stationary but drifting over time. Concept drift detection aims to detect such
drifts and adapt the model so as to mitigate any deterioration in the model's
predictive performance. Unfortunately, most existing concept drift detection
methods rely on a strong and over-optimistic condition that the true labels are
available immediately for all already classified instances. In this paper, a
novel Hierarchical Hypothesis Testing framework with Request-and-Reverify
strategy is developed to detect concept drifts by requesting labels only when
necessary. Two methods, namely Hierarchical Hypothesis Testing with
Classification Uncertainty (HHT-CU) and Hierarchical Hypothesis Testing with
Attribute-wise "Goodness-of-fit" (HHT-AG), are proposed respectively under the
novel framework. In experiments with benchmark datasets, our methods
demonstrate overwhelming advantages over state-of-the-art unsupervised drift
detectors. More importantly, our methods even outperform DDM (the widely used
supervised drift detector) when we use significantly fewer labels.Comment: Published as a conference paper at IJCAI 201
Evolving Ensemble Fuzzy Classifier
The concept of ensemble learning offers a promising avenue in learning from
data streams under complex environments because it addresses the bias and
variance dilemma better than its single model counterpart and features a
reconfigurable structure, which is well suited to the given context. While
various extensions of ensemble learning for mining non-stationary data streams
can be found in the literature, most of them are crafted under a static base
classifier and revisits preceding samples in the sliding window for a
retraining step. This feature causes computationally prohibitive complexity and
is not flexible enough to cope with rapidly changing environments. Their
complexities are often demanding because it involves a large collection of
offline classifiers due to the absence of structural complexities reduction
mechanisms and lack of an online feature selection mechanism. A novel evolving
ensemble classifier, namely Parsimonious Ensemble pENsemble, is proposed in
this paper. pENsemble differs from existing architectures in the fact that it
is built upon an evolving classifier from data streams, termed Parsimonious
Classifier pClass. pENsemble is equipped by an ensemble pruning mechanism,
which estimates a localized generalization error of a base classifier. A
dynamic online feature selection scenario is integrated into the pENsemble.
This method allows for dynamic selection and deselection of input features on
the fly. pENsemble adopts a dynamic ensemble structure to output a final
classification decision where it features a novel drift detection scenario to
grow the ensemble structure. The efficacy of the pENsemble has been numerically
demonstrated through rigorous numerical studies with dynamic and evolving data
streams where it delivers the most encouraging performance in attaining a
tradeoff between accuracy and complexity.Comment: this paper has been published by IEEE Transactions on Fuzzy System
COMPOSE: Compacted object sample extraction a framework for semi-supervised learning in nonstationary environments
An increasing number of real-world applications are associated with streaming data drawn from drifting and nonstationary distributions. These applications demand new algorithms that can learn and adapt to such changes, also known as concept drift. Proper characterization of such data with existing approaches typically requires substantial amount of labeled instances, which may be difficult, expensive, or even impractical to obtain. In this thesis, compacted object sample extraction (COMPOSE) is introduced - a computational geometry-based framework to learn from nonstationary streaming data - where labels are unavailable (or presented very sporadically) after initialization. The feasibility and performance of the algorithm are evaluated on several synthetic and real-world data sets, which present various different scenarios of initially labeled streaming environments. On carefully designed synthetic data sets, we also compare the performance of COMPOSE against the optimal Bayes classifier, as well as the arbitrary subpopulation tracker algorithm, which addresses a similar environment referred to as extreme verification latency. Furthermore, using the real-world National Oceanic and Atmospheric Administration weather data set, we demonstrate that COMPOSE is competitive even with a well-established and fully supervised nonstationary learning algorithm that receives labeled data in every batch
Data-efficient Online Classification with Siamese Networks and Active Learning
An ever increasing volume of data is nowadays becoming available in a
streaming manner in many application areas, such as, in critical infrastructure
systems, finance and banking, security and crime and web analytics. To meet
this new demand, predictive models need to be built online where learning
occurs on-the-fly. Online learning poses important challenges that affect the
deployment of online classification systems to real-life problems. In this
paper we investigate learning from limited labelled, nonstationary and
imbalanced data in online classification. We propose a learning method that
synergistically combines siamese neural networks and active learning. The
proposed method uses a multi-sliding window approach to store data, and
maintains separate and balanced queues for each class. Our study shows that the
proposed method is robust to data nonstationarity and imbalance, and
significantly outperforms baselines and state-of-the-art algorithms in terms of
both learning speed and performance. Importantly, it is effective even when
only 1% of the labels of the arriving instances are available.Comment: 2020 International Joint Conference on Neural Networks (IJCNN),
Glasgow, UK, 202
Incremental Learning on Non-stationary Data Stream using Ensemble Approach
Incremental Learning on non stationary distribution has been shown to be a very challenging problem in machine learning and data mining, because the joint probability distribution between the data and classes changes over time. Many real time problems suffer concept drift as they changes with time. For example, an advertisement recommendation system, in which customer’s behavior may change depending on the season of the year, on the inflation and on new products made available. An extra challenge arises when the classes to be learned are not represented equally in the training data i.e. classes are imbalanced, as most machine learning algorithms work well only when the training data is balanced. The objective of this paper is to develop an ensemble based classification algorithm for non-stationary data stream (ENSDS) with focus on two-class problems. In addition, we are presenting here an exhaustive comparison of purposed algorithms with state-of-the-art classification approaches using different evaluation measures like recall, f-measure and g-mea
- …