14,544 research outputs found
Autoencoder-based Anomaly Detection in Streaming Data with Incremental Learning and Concept Drift Adaptation
In our digital universe nowadays, enormous amount of data are produced in a
streaming manner in a variety of application areas. These data are often
unlabelled. In this case, identifying infrequent events, such as anomalies,
poses a great challenge. This problem becomes even more difficult in
non-stationary environments, which can cause deterioration of the predictive
performance of a model. To address the above challenges, the paper proposes an
autoencoder-based incremental learning method with drift detection
(strAEm++DD). Our proposed method strAEm++DD leverages on the advantages of
both incremental learning and drift detection. We conduct an experimental study
using real-world and synthetic datasets with severe or extreme class imbalance,
and provide an empirical analysis of strAEm++DD. We further conduct a
comparative study, showing that the proposed method significantly outperforms
existing baseline and advanced methods.Comment: anomaly detection, concept drift, incremental anomaly detection,
concept drift, incremental learning, autoencoders, data streams, class
imbalance, nonstationary environment
Evolving Ensemble Fuzzy Classifier
The concept of ensemble learning offers a promising avenue in learning from
data streams under complex environments because it addresses the bias and
variance dilemma better than its single model counterpart and features a
reconfigurable structure, which is well suited to the given context. While
various extensions of ensemble learning for mining non-stationary data streams
can be found in the literature, most of them are crafted under a static base
classifier and revisits preceding samples in the sliding window for a
retraining step. This feature causes computationally prohibitive complexity and
is not flexible enough to cope with rapidly changing environments. Their
complexities are often demanding because it involves a large collection of
offline classifiers due to the absence of structural complexities reduction
mechanisms and lack of an online feature selection mechanism. A novel evolving
ensemble classifier, namely Parsimonious Ensemble pENsemble, is proposed in
this paper. pENsemble differs from existing architectures in the fact that it
is built upon an evolving classifier from data streams, termed Parsimonious
Classifier pClass. pENsemble is equipped by an ensemble pruning mechanism,
which estimates a localized generalization error of a base classifier. A
dynamic online feature selection scenario is integrated into the pENsemble.
This method allows for dynamic selection and deselection of input features on
the fly. pENsemble adopts a dynamic ensemble structure to output a final
classification decision where it features a novel drift detection scenario to
grow the ensemble structure. The efficacy of the pENsemble has been numerically
demonstrated through rigorous numerical studies with dynamic and evolving data
streams where it delivers the most encouraging performance in attaining a
tradeoff between accuracy and complexity.Comment: this paper has been published by IEEE Transactions on Fuzzy System
Concept Drift Detection in Data Stream Mining: The Review of Contemporary Literature
Mining process such as classification, clustering of progressive or dynamic data is a critical objective of the information retrieval and knowledge discovery; in particular, it is more sensitive in data stream mining models due to the possibility of significant change in the type and dimensionality of the data over a period. The influence of these changes over the mining process termed as concept drift. The concept drift that depict often in streaming data causes unbalanced performance of the mining models adapted. Hence, it is obvious to boost the mining models to predict and analyse the concept drift to achieve the performance at par best. The contemporary literature evinced significant contributions to handle the concept drift, which fall in to supervised, unsupervised learning, and statistical assessment approaches. This manuscript contributes the detailed review of the contemporary concept-drift detection models depicted in recent literature. The contribution of the manuscript includes the nomenclature of the concept drift models and their impact of imbalanced data tuples
Hellinger Distance Trees for Imbalanced Streams
Classifiers trained on data sets possessing an imbalanced class distribution
are known to exhibit poor generalisation performance. This is known as the
imbalanced learning problem. The problem becomes particularly acute when we
consider incremental classifiers operating on imbalanced data streams,
especially when the learning objective is rare class identification. As
accuracy may provide a misleading impression of performance on imbalanced data,
existing stream classifiers based on accuracy can suffer poor minority class
performance on imbalanced streams, with the result being low minority class
recall rates. In this paper we address this deficiency by proposing the use of
the Hellinger distance measure, as a very fast decision tree split criterion.
We demonstrate that by using Hellinger a statistically significant improvement
in recall rates on imbalanced data streams can be achieved, with an acceptable
increase in the false positive rate.Comment: 6 Pages, 2 figures, to be published in Proceedings 22nd International
Conference on Pattern Recognition (ICPR) 201
Evaluation methods and decision theory for classification of streaming data with temporal dependence
Predictive modeling on data streams plays an important role in modern data analysis, where data arrives continuously and needs to be mined in real time. In the stream setting the data distribution is often evolving over time, and models that update themselves during operation are becoming the state-of-the-art. This paper formalizes a learning and evaluation scheme of such predictive models. We theoretically analyze evaluation of classifiers on streaming data with temporal dependence. Our findings suggest that the commonly accepted data stream classification measures, such as classification accuracy and Kappa statistic, fail to diagnose cases of poor performance when temporal dependence is present, therefore they should not be used as sole performance indicators. Moreover, classification accuracy can be misleading if used as a proxy for evaluating change detectors with datasets that have temporal dependence. We formulate the decision theory for streaming data classification with temporal dependence and develop a new evaluation methodology for data stream classification that takes temporal dependence into account. We propose a combined measure for classification performance, that takes into account temporal dependence, and we recommend using it as the main performance measure in classification of streaming data
- …