15 research outputs found
Advances on Concept Drift Detection in Regression Tasks using Social Networks Theory
Mining data streams is one of the main studies in machine learning area due
to its application in many knowledge areas. One of the major challenges on
mining data streams is concept drift, which requires the learner to discard the
current concept and adapt to a new one. Ensemble-based drift detection
algorithms have been used successfully to the classification task but usually
maintain a fixed size ensemble of learners running the risk of needlessly
spending processing time and memory. In this paper we present improvements to
the Scale-free Network Regressor (SFNR), a dynamic ensemble-based method for
regression that employs social networks theory. In order to detect concept
drifts SFNR uses the Adaptive Window (ADWIN) algorithm. Results show
improvements in accuracy, especially in concept drift situations and better
performance compared to other state-of-the-art algorithms in both real and
synthetic data
A survey on feature drift adaptation: Definition, benchmark, challenges and future directions
Data stream mining is a fast growing research topic due to the ubiquity of data in several real-world problems. Given their ephemeral nature, data stream sources are expected to undergo changes in data distribution, a phenomenon called concept drift. This paper focuses on one specific type of drift that has not yet been thoroughly studied, namely feature drift. Feature drift occurs whenever a subset of features becomes, or ceases to be, relevant to the learning task; thus, learners must detect and adapt to these changes accordingly. We survey existing work on feature drift adaptation with both explicit and implicit approaches. Additionally, we benchmark several algorithms and a naive feature drift detection approach using synthetic and real-world datasets. The results from our experiments indicate the need for future research in this area as even naive approaches produced gains in accuracy while reducing resources usage. Finally, we state current research topics, challenges and future directions for feature drift adaptation
Evaluating k-NN in the Classification of Data Streams with Concept Drift
Data streams are often defined as large amounts of data flowing continuously
at high speed. Moreover, these data are likely subject to changes in data
distribution, known as concept drift. Given all the reasons mentioned above,
learning from streams is often online and under restrictions of memory
consumption and run-time. Although many classification algorithms exist, most
of the works published in the area use Naive Bayes (NB) and Hoeffding Trees
(HT) as base learners in their experiments. This article proposes an in-depth
evaluation of k-Nearest Neighbors (k-NN) as a candidate for classifying data
streams subjected to concept drift. It also analyses the complexity in time and
the two main parameters of k-NN, i.e., the number of nearest neighbors used for
predictions (k), and window size (w). We compare different parameter values for
k-NN and contrast it to NB and HT both with and without a drift detector (RDDM)
in many datasets. We formulated and answered 10 research questions which led to
the conclusion that k-NN is a worthy candidate for data stream classification,
especially when the run-time constraint is not too restrictive.Comment: 25 pages, 10 tables, 7 figures + 30 pages appendi
Adaptive random forests for evolving data stream classification
Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests algorithm that can be considered state-of-the-art in comparison to bagging and boosting based algorithms. In this work, we present the adaptive random forest (ARF) algorithm for classification of evolving data streams. In contrast to previous attempts of replicating random forests for data stream learning, ARF includes an effective resampling method and adaptive operators that can cope with different types of concept drifts without complex optimizations for different data sets. We present experiments with a parallel implementation of ARF which has no degradation in terms of classification performance in comparison to a serial implementation, since trees and adaptive operators are independent from one another. Finally, we compare ARF with state-of-the-art algorithms in a traditional test-then-train evaluation and a novel delayed labelling evaluation, and show that ARF is accurate and uses a feasible amount of resources
Deep Single Models vs. Ensembles: Insights for a Fast Deployment of Parking Monitoring Systems
Searching for available parking spots in high-density urban centers is a
stressful task for drivers that can be mitigated by systems that know in
advance the nearest parking space available.
To this end, image-based systems offer cost advantages over other
sensor-based alternatives (e.g., ultrasonic sensors), requiring less physical
infrastructure for installation and maintenance.
Despite recent deep learning advances, deploying intelligent parking
monitoring is still a challenge since most approaches involve collecting and
labeling large amounts of data, which is laborious and time-consuming. Our
study aims to uncover the challenges in creating a global framework, trained
using publicly available labeled parking lot images, that performs accurately
across diverse scenarios, enabling the parking space monitoring as a
ready-to-use system to deploy in a new environment. Through exhaustive
experiments involving different datasets and deep learning architectures,
including fusion strategies and ensemble methods, we found that models trained
on diverse datasets can achieve 95\% accuracy without the burden of data
annotation and model training on the target parking lotComment: An improved version of this manuscript was submitted to IEEE ICMLA
2023 (Dec/23
Boosting decision stumps for dynamic feature selection on data streams
Feature selection targets the identification of which features of a dataset are relevant to the learning task. It is also widely known and used to improve computation times, reduce computation requirements, and to decrease the impact of the curse of dimensionality and enhancing the generalization rates of classifiers. In data streams, classifiers shall benefit from all the items above, but more importantly, from the fact that the relevant subset of features may drift over time. In this paper, we propose a novel dynamic feature selection method for data streams called Adaptive Boosting for Feature Selection (ABFS). ABFS chains decision stumps and drift detectors, and as a result, identifies which features are relevant to the learning task as the stream progresses with reasonable success. In addition to our proposed algorithm, we bring feature selection-specific metrics from batch learning to streaming scenarios. Next, we evaluate ABFS according to these metrics in both synthetic and real-world scenarios. As a result, ABFS improves the classification rates of different types of learners and eventually enhances computational resources usage
Random forest kernel for high-dimension low sample size classification
High dimension, low sample size (HDLSS) problems are numerous among real-world applications of machine learning. From medical images to text processing, traditional machine learning algorithms are usually unsuccessful in learning the best possible concept from such data. In a previous work, we proposed a dissimilarity-based approach for multi-view classification, the Random Forest Dissimilarity (RFD), that perfoms state-of-the-art results for such problems. In this work, we transpose the core principle of this approach to solving HDLSS classification problems, by using the RF similarity measure as a learned precomputed SVM kernel (RFSVM). We show that such a learned similarity measure is particularly suited and accurate for this classification context. Experiments conducted on 40 public HDLSS classification datasets, supported by rigorous statistical analyses, show that the RFSVM method outperforms existing methods for the majority of HDLSS problems and remains at the same time very competitive for low or non-HDLSS problems