2,566 research outputs found
Evaluation methods and decision theory for classification of streaming data with temporal dependence
Predictive modeling on data streams plays an important role in modern data analysis, where data arrives continuously and needs to be mined in real time. In the stream setting the data distribution is often evolving over time, and models that update themselves during operation are becoming the state-of-the-art. This paper formalizes a learning and evaluation scheme of such predictive models. We theoretically analyze evaluation of classifiers on streaming data with temporal dependence. Our findings suggest that the commonly accepted data stream classification measures, such as classification accuracy and Kappa statistic, fail to diagnose cases of poor performance when temporal dependence is present, therefore they should not be used as sole performance indicators. Moreover, classification accuracy can be misleading if used as a proxy for evaluating change detectors with datasets that have temporal dependence. We formulate the decision theory for streaming data classification with temporal dependence and develop a new evaluation methodology for data stream classification that takes temporal dependence into account. We propose a combined measure for classification performance, that takes into account temporal dependence, and we recommend using it as the main performance measure in classification of streaming data
A Bi-Criteria Active Learning Algorithm for Dynamic Data Streams
Active learning (AL) is a promising way to efficiently
building up training sets with minimal supervision. A learner
deliberately queries specific instances to tune the classifier’s
model using as few labels as possible. The challenge for streaming
is that the data distribution may evolve over time and therefore
the model must adapt. Another challenge is the sampling bias
where the sampled training set does not reflect the underlying
data distribution. In presence of concept drift, sampling bias is
more likely to occur as the training set needs to represent the
whole evolving data. To tackle these challenges, we propose a
novel bi-criteria AL approach (BAL) that relies on two selection
criteria, namely
label uncertainty criterion
and
density-based cri-
terion
. While the first criterion selects instances that are the most
uncertain in terms of class membership, the latter dynamically
curbs the sampling bias by weighting the samples to reflect on the
true underlying distribution. To design and implement these two
criteria for learning from streams, BAL adopts a Bayesian online
learning approach and combines online classification and online
clustering through the use of
online logistic regression
and
online
growing Gaussian mixture models
respectively. Empirical results
obtained on standard synthetic and real-world benchmarks show
the high performance of the proposed BAL method compared to
the state-of-the-art AL method
Clustering based active learning for evolving data streams
Data labeling is an expensive and time-consuming task. Choosing which labels to use is increasingly becoming important. In the active learning setting, a classifier is trained by asking for labels for only a small fraction of all instances. While many works exist that deal with this issue in non-streaming scenarios, few works exist in the data stream setting. In this paper we propose a new active learning approach for evolving data streams based on a pre-clustering step, for selecting the most informative instances for labeling. We consider a batch incremental setting: when a new batch arrives, first we cluster the examples, and then, we select the best instances to train the learner. The clustering approach allows to cover the whole data space avoiding to oversample examples from only few areas. We compare our method w.r.t. state of the art active learning strategies over real datasets. The results highlight the improvement in performance of our proposal. Experiments on parameter sensitivity are also reported
Clustering based active learning for evolving data streams
Data labeling is an expensive and time-consuming task. Choosing which labels to use is increasingly becoming important. In the active learning setting, a classifier is trained by asking for labels for only a small fraction of all instances. While many works exist that deal with this issue in non-streaming scenarios, few works exist in the data stream setting. In this paper we propose a new active learning approach for evolving data streams based on a pre-clustering step, for selecting the most informative instances for labeling. We consider a batch incremental setting: when a new batch arrives, first we cluster the examples, and then, we select the best instances to train the learner. The clustering approach allows to cover the whole data space avoiding to oversample examples from only few areas. We compare our method w.r.t. state of the art active learning strategies over real datasets. The results highlight the improvement in performance of our proposal. Experiments on parameter sensitivity are also reported
Adaptation Strategies for Automated Machine Learning on Evolving Data
Automated Machine Learning (AutoML) systems have been shown to efficiently
build good models for new datasets. However, it is often not clear how well
they can adapt when the data evolves over time. The main goal of this study is
to understand the effect of data stream challenges such as concept drift on the
performance of AutoML methods, and which adaptation strategies can be employed
to make them more robust. To that end, we propose 6 concept drift adaptation
strategies and evaluate their effectiveness on different AutoML approaches. We
do this for a variety of AutoML approaches for building machine learning
pipelines, including those that leverage Bayesian optimization, genetic
programming, and random search with automated stacking. These are evaluated
empirically on real-world and synthetic data streams with different types of
concept drift. Based on this analysis, we propose ways to develop more
sophisticated and robust AutoML techniques.Comment: 12 pages, 7 figures (14 counting subfigures), submitted to TPAMI -
AutoML Special Issu
- …