395 research outputs found
Request-and-Reverify: Hierarchical Hypothesis Testing for Concept Drift Detection with Expensive Labels
One important assumption underlying common classification models is the
stationarity of the data. However, in real-world streaming applications, the
data concept indicated by the joint distribution of feature and label is not
stationary but drifting over time. Concept drift detection aims to detect such
drifts and adapt the model so as to mitigate any deterioration in the model's
predictive performance. Unfortunately, most existing concept drift detection
methods rely on a strong and over-optimistic condition that the true labels are
available immediately for all already classified instances. In this paper, a
novel Hierarchical Hypothesis Testing framework with Request-and-Reverify
strategy is developed to detect concept drifts by requesting labels only when
necessary. Two methods, namely Hierarchical Hypothesis Testing with
Classification Uncertainty (HHT-CU) and Hierarchical Hypothesis Testing with
Attribute-wise "Goodness-of-fit" (HHT-AG), are proposed respectively under the
novel framework. In experiments with benchmark datasets, our methods
demonstrate overwhelming advantages over state-of-the-art unsupervised drift
detectors. More importantly, our methods even outperform DDM (the widely used
supervised drift detector) when we use significantly fewer labels.Comment: Published as a conference paper at IJCAI 201
Adaptive Algorithms For Classification On High-Frequency Data Streams: Application To Finance
Mención Internacional en el título de doctorIn recent years, the problem of concept drift has gained importance in the financial
domain. The succession of manias, panics and crashes have stressed the nonstationary
nature and the likelihood of drastic structural changes in financial markets.
The most recent literature suggests the use of conventional machine learning and statistical
approaches for this. However, these techniques are unable or slow to adapt
to non-stationarities and may require re-training over time, which is computationally
expensive and brings financial risks.
This thesis proposes a set of adaptive algorithms to deal with high-frequency data
streams and applies these to the financial domain. We present approaches to handle
different types of concept drifts and perform predictions using up-to-date models.
These mechanisms are designed to provide fast reaction times and are thus applicable
to high-frequency data. The core experiments of this thesis are based on the prediction
of the price movement direction at different intraday resolutions in the SPDR S&P 500
exchange-traded fund. The proposed algorithms are benchmarked against other popular
methods from the data stream mining literature and achieve competitive results.
We believe that this thesis opens good research prospects for financial forecasting
during market instability and structural breaks. Results have shown that our proposed
methods can improve prediction accuracy in many of these scenarios. Indeed, the
results obtained are compatible with ideas against the efficient market hypothesis.
However, we cannot claim that we can beat consistently buy and hold; therefore, we
cannot reject it.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: Gustavo Recio Isasi.- Secretario: Pedro Isasi Viñuela.- Vocal: Sandra García Rodrígue
New perspectives and methods for stream learning in the presence of concept drift.
153 p.Applications that generate data in the form of fast streams from non-stationary environments, that is,those where the underlying phenomena change over time, are becoming increasingly prevalent. In thiskind of environments the probability density function of the data-generating process may change overtime, producing a drift. This causes that predictive models trained over these stream data become obsoleteand do not adapt suitably to the new distribution. Specially in online learning scenarios, there is apressing need for new algorithms that adapt to this change as fast as possible, while maintaining goodperformance scores. Examples of these applications include making inferences or predictions based onfinancial data, energy demand and climate data analysis, web usage or sensor network monitoring, andmalware/spam detection, among many others.Online learning and concept drift are two of the most hot topics in the recent literature due to theirrelevance for the so-called Big Data paradigm, where nowadays we can find an increasing number ofapplications based on training data continuously available, named as data streams. Thus, learning in nonstationaryenvironments requires adaptive or evolving approaches that can monitor and track theunderlying changes, and adapt a model to accommodate those changes accordingly. In this effort, Iprovide in this thesis a comprehensive state-of-the-art approaches as well as I identify the most relevantopen challenges in the literature, while focusing on addressing three of them by providing innovativeperspectives and methods.This thesis provides with a complete overview of several related fields, and tackles several openchallenges that have been identified in the very recent state of the art. Concretely, it presents aninnovative way to generate artificial diversity in ensembles, a set of necessary adaptations andimprovements for spiking neural networks in order to be used in online learning scenarios, and finally, adrift detector based on this former algorithm. All of these approaches together constitute an innovativework aimed at presenting new perspectives and methods for the field
Concept Drift Adaptation in Text Stream Mining Settings: A Comprehensive Review
Due to the advent and increase in the popularity of the Internet, people have
been producing and disseminating textual data in several ways, such as reviews,
social media posts, and news articles. As a result, numerous researchers have
been working on discovering patterns in textual data, especially because social
media posts function as social sensors, indicating peoples' opinions,
interests, etc. However, most tasks regarding natural language processing are
addressed using traditional machine learning methods and static datasets. This
setting can lead to several problems, such as an outdated dataset, which may
not correspond to reality, and an outdated model, which has its performance
degrading over time. Concept drift is another aspect that emphasizes these
issues, which corresponds to data distribution and pattern changes. In a text
stream scenario, it is even more challenging due to its characteristics, such
as the high speed and data arriving sequentially. In addition, models for this
type of scenario must adhere to the constraints mentioned above while learning
from the stream by storing texts for a limited time and consuming low memory.
In this study, we performed a systematic literature review regarding concept
drift adaptation in text stream scenarios. Considering well-defined criteria,
we selected 40 papers to unravel aspects such as text drift categories, types
of text drift detection, model update mechanism, the addressed stream mining
tasks, types of text representations, and text representation update mechanism.
In addition, we discussed drift visualization and simulation and listed
real-world datasets used in the selected papers. Therefore, this paper
comprehensively reviews the concept drift adaptation in text stream mining
scenarios.Comment: 49 page
Online adaptive decision trees based on concentration inequalities.
Classification trees are a powerful tool for mining non-stationary data streams. In these situations, mas sive data are constantly generated at high speed and the underlying target function can change over time.
The iadem family of algorithms is based on Hoeffding’s and Chernoff’s bounds and induces online deci sion trees from data streams, but is not able to handle concept drift. This study extends this family to
deal with time-changing data streams. The new online algorithm, named iadem-3, performs two main
actions in response to a concept drift. Firstly, it resets the variables affected by the change and main tains unbroken the structure of the tree, which allows for changes in which consecutive target functions
are very similar. Secondly, it creates alternative models that replace parts of the main tree when they
significantly improve the accuracy of the model, thereby rebuilding the main tree if needed. An online
change detector and a non-parametric statistical test based on Hoeffding’s bounds are used to guaran tee this significance. A new pruning method is also incorporated in iadem-3, making sure that all split
tests previously installed in decision nodes are useful. The learning model is also viewed as an ensem ble of classifiers, and predictions of the main and alternative models are combined to classify unlabeled
examples. iadem-3 is empirically compared with various well-known decision tree induction algorithms
for concept drift detection. We empirically show that our new algorithm often reaches higher levels of
accuracy with smaller decision tree models, maintaining the processing time bounded, irrespective of the
number of instances processed
Continual learning from stationary and non-stationary data
Continual learning aims at developing models that are capable of working on constantly evolving problems over a long-time horizon. In such environments, we can distinguish three essential aspects of training and maintaining machine learning models - incorporating new knowledge, retaining it and reacting to changes. Each of them poses its own challenges, constituting a compound problem with multiple goals.
Remembering previously incorporated concepts is the main property of a model that is required when dealing with stationary distributions. In non-stationary environments, models should be capable of selectively forgetting outdated decision boundaries and adapting to new concepts. Finally, a significant difficulty can be found in combining these two abilities within a single learning algorithm, since, in such scenarios, we have to balance remembering and forgetting instead of focusing only on one aspect.
The presented dissertation addressed these problems in an exploratory way. Its main goal was to grasp the continual learning paradigm as a whole, analyze its different branches and tackle identified issues covering various aspects of learning from sequentially incoming data. By doing so, this work not only filled several gaps in the current continual learning research but also emphasized the complexity and diversity of challenges existing in this domain. Comprehensive experiments conducted for all of the presented contributions have demonstrated their effectiveness and substantiated the validity of the stated claims
Bi-directional online transfer learning : a framework
Transfer learning uses knowledge learnt in source domains to aid predictions in a target domain. When source and target domains are online, they are susceptible to concept drift, which may alter the mapping of knowledge between them. Drifts in online environments can make additional information available in each domain, necessitating continuing knowledge transfer both from source to target and vice versa. To address this, we introduce the Bi-directional Online Transfer Learning (BOTL) framework, which uses knowledge learnt in each online domain to aid predictions in others. We introduce two variants of BOTL that incorporate model culling to minimise negative transfer in frameworks with high volumes of model transfer. We consider the theoretical loss of BOTL, which indicates that BOTL achieves a loss no worse than the underlying concept drift detection algorithm. We evaluate BOTL using two existing concept drift detection algorithms: RePro and ADWIN. Additionally, we present a concept drift detection algorithm, Adaptive Windowing with Proactive drift detection (AWPro), which reduces the computation and communication demands of BOTL. Empirical results are presented using two data stream generators: the drifting hyperplane emulator and the smart home heating simulator, and real-world data predicting Time To Collision (TTC) from vehicle telemetry. The evaluation shows BOTL and its variants outperform the concept drift detection strategies and the existing state-of-the-art online transfer learning technique
- …