41 research outputs found
Advances on Concept Drift Detection in Regression Tasks using Social Networks Theory
Mining data streams is one of the main studies in machine learning area due
to its application in many knowledge areas. One of the major challenges on
mining data streams is concept drift, which requires the learner to discard the
current concept and adapt to a new one. Ensemble-based drift detection
algorithms have been used successfully to the classification task but usually
maintain a fixed size ensemble of learners running the risk of needlessly
spending processing time and memory. In this paper we present improvements to
the Scale-free Network Regressor (SFNR), a dynamic ensemble-based method for
regression that employs social networks theory. In order to detect concept
drifts SFNR uses the Adaptive Window (ADWIN) algorithm. Results show
improvements in accuracy, especially in concept drift situations and better
performance compared to other state-of-the-art algorithms in both real and
synthetic data
A survey on feature drift adaptation: Definition, benchmark, challenges and future directions
Data stream mining is a fast growing research topic due to the ubiquity of data in several real-world problems. Given their ephemeral nature, data stream sources are expected to undergo changes in data distribution, a phenomenon called concept drift. This paper focuses on one specific type of drift that has not yet been thoroughly studied, namely feature drift. Feature drift occurs whenever a subset of features becomes, or ceases to be, relevant to the learning task; thus, learners must detect and adapt to these changes accordingly. We survey existing work on feature drift adaptation with both explicit and implicit approaches. Additionally, we benchmark several algorithms and a naive feature drift detection approach using synthetic and real-world datasets. The results from our experiments indicate the need for future research in this area as even naive approaches produced gains in accuracy while reducing resources usage. Finally, we state current research topics, challenges and future directions for feature drift adaptation
Look At Me, No Replay! SurpriseNet: Anomaly Detection Inspired Class Incremental Learning
Continual learning aims to create artificial neural networks capable of
accumulating knowledge and skills through incremental training on a sequence of
tasks. The main challenge of continual learning is catastrophic interference,
wherein new knowledge overrides or interferes with past knowledge, leading to
forgetting. An associated issue is the problem of learning "cross-task
knowledge," where models fail to acquire and retain knowledge that helps
differentiate classes across task boundaries. A common solution to both
problems is "replay," where a limited buffer of past instances is utilized to
learn cross-task knowledge and mitigate catastrophic interference. However, a
notable drawback of these methods is their tendency to overfit the limited
replay buffer. In contrast, our proposed solution, SurpriseNet, addresses
catastrophic interference by employing a parameter isolation method and
learning cross-task knowledge using an auto-encoder inspired by anomaly
detection. SurpriseNet is applicable to both structured and unstructured data,
as it does not rely on image-specific inductive biases. We have conducted
empirical experiments demonstrating the strengths of SurpriseNet on various
traditional vision continual-learning benchmarks, as well as on structured data
datasets. Source code made available at https://doi.org/10.5281/zenodo.8247906
and https://github.com/tachyonicClock/SurpriseNet-CIKM-2
SOKNL: A novel way of integrating K-nearest neighbours with adaptive random forest regression for data streams
Most research in machine learning for data streams has focused on classification algorithms, whereas regression methods have received a lot less attention. This paper proposes Self-Optimising K-Nearest Leaves (SOKNL), a novel forest-based algorithm for streaming regression problems. Specifically, the Adaptive Random Forest Regression, a state-of-the-art online regression algorithm is extended like this: in each leaf, a representative data point â also called centroid â is generated by compressing the information from all instances in that leaf. During the prediction step, instead of letting all trees in the forest participate, the distances between the input instance and all centroids from relevant leaves are calculated, only k trees that possess the smallest distances are utilised for the prediction. Furthermore, we simplify the algorithm by introducing a mechanism for tuning the k values, which is dynamically and automatically optimised based on historical information. This new algorithm produces promising predictive results and achieves a superior ranking according to statistical testing when compared with several standard stream regression methods over typical benchmark datasets. This improvement incurs only a small increase in runtime and memory consumption over the basic Adaptive Random Forest Regressor
Adaptive online domain incremental continual learning
Continual Learning (CL) problems pose significant challenges for Neural Network (NN)s. Online Domain Incremental Continual Learning (ODI-CL) refers to situations where the data distribution may change from one task to another. These changes can severely affect the learned model, focusing too much on previous data and failing to properly learn and represent new concepts. Conversely, if a model constantly forgets previously learned knowledge, it may be deemed too unstable and unsuitable. This work proposes Online Domain Incremental Pool (ODIP), a novel method to cope with catastrophic forgetting. ODIP also employs automatic concept drift detection and does not require task ids during training. ODIP maintains a pool of learners, freezing and storing the best one after training on each task. An additional Task Predictor (TP) is trained to select the most appropriate NN from the frozen pool for prediction. We compare ODIP against regularization methods and observe that it yields competitive predictive performance
A Survey on Semi-Supervised Learning for Delayed Partially Labelled Data Streams
Unlabelled data appear in many domains and are particularly relevant to
streaming applications, where even though data is abundant, labelled data is
rare. To address the learning problems associated with such data, one can
ignore the unlabelled data and focus only on the labelled data (supervised
learning); use the labelled data and attempt to leverage the unlabelled data
(semi-supervised learning); or assume some labels will be available on request
(active learning). The first approach is the simplest, yet the amount of
labelled data available will limit the predictive performance. The second
relies on finding and exploiting the underlying characteristics of the data
distribution. The third depends on an external agent to provide the required
labels in a timely fashion. This survey pays special attention to methods that
leverage unlabelled data in a semi-supervised setting. We also discuss the
delayed labelling issue, which impacts both fully supervised and
semi-supervised methods. We propose a unified problem setting, discuss the
learning guarantees and existing methods, explain the differences between
related problem settings. Finally, we review the current benchmarking practices
and propose adaptations to enhance them
Balancing performance and energy consumption of bagging ensembles for the classification of data streams in edge computing
In recent years, the Edge Computing (EC) paradigm has emerged as an enabling factor for developing technologies like the Internet of Things (IoT) and 5G networks, bridging the gap between Cloud Computing services and end-users, supporting low latency, mobility, and location awareness to delay-sensitive applications. An increasing number of solutions in EC have employed machine learning (ML) methods to perform data classification and other information processing tasks on continuous and evolving data streams. Usually, such solutions have to cope with vast amounts of data that come as data streams while balancing energy consumption, latency, and the predictive performance of the algorithms. Ensemble methods achieve remarkable predictive performance when applied to evolving data streams due to several models and the possibility of selective resets. This work investigates a strategy that introduces short intervals to defer the processing of mini-batches. Well balanced, our strategy can improve the performance (i.e., delay, throughput) and reduce the energy consumption of bagging ensembles to classify data streams. The experimental evaluation involved six state-of-art ensemble algorithms (OzaBag, OzaBag Adaptive Size Hoeffding Tree, Online Bagging ADWIN, Leveraging Bagging, Adaptive RandomForest, and Streaming Random Patches) applying five widely used machine learning benchmark datasets with varied characteristics on three computer platforms. As a result, our strategy can significantly reduce energy consumption in 96% of the experimental scenarios evaluated. Despite the trade-offs, it is possible to balance them to avoid significant loss in predictive performance
Boosting decision stumps for dynamic feature selection on data streams
Feature selection targets the identification of which features of a dataset are relevant to the learning task. It is also widely known and used to improve computation times, reduce computation requirements, and to decrease the impact of the curse of dimensionality and enhancing the generalization rates of classifiers. In data streams, classifiers shall benefit from all the items above, but more importantly, from the fact that the relevant subset of features may drift over time. In this paper, we propose a novel dynamic feature selection method for data streams called Adaptive Boosting for Feature Selection (ABFS). ABFS chains decision stumps and drift detectors, and as a result, identifies which features are relevant to the learning task as the stream progresses with reasonable success. In addition to our proposed algorithm, we bring feature selection-specific metrics from batch learning to streaming scenarios. Next, we evaluate ABFS according to these metrics in both synthetic and real-world scenarios. As a result, ABFS improves the classification rates of different types of learners and eventually enhances computational resources usage
Adaptive random forests for evolving data stream classification
Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests algorithm that can be considered state-of-the-art in comparison to bagging and boosting based algorithms. In this work, we present the adaptive random forest (ARF) algorithm for classification of evolving data streams. In contrast to previous attempts of replicating random forests for data stream learning, ARF includes an effective resampling method and adaptive operators that can cope with different types of concept drifts without complex optimizations for different data sets. We present experiments with a parallel implementation of ARF which has no degradation in terms of classification performance in comparison to a serial implementation, since trees and adaptive operators are independent from one another. Finally, we compare ARF with state-of-the-art algorithms in a traditional test-then-train evaluation and a novel delayed labelling evaluation, and show that ARF is accurate and uses a feasible amount of resources