598 research outputs found
A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework
Class imbalance poses new challenges when it comes to classifying data
streams. Many algorithms recently proposed in the literature tackle this
problem using a variety of data-level, algorithm-level, and ensemble
approaches. However, there is a lack of standardized and agreed-upon procedures
on how to evaluate these algorithms. This work presents a taxonomy of
algorithms for imbalanced data streams and proposes a standardized, exhaustive,
and informative experimental testbed to evaluate algorithms in a collection of
diverse and challenging imbalanced data stream scenarios. The experimental
study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced
data streams that combine static and dynamic class imbalance ratios,
instance-level difficulties, concept drift, real-world and semi-synthetic
datasets in binary and multi-class scenarios. This leads to the largest
experimental study conducted so far in the data stream mining domain. We
discuss the advantages and disadvantages of state-of-the-art classifiers in
each of these scenarios and we provide general recommendations to end-users for
selecting the best algorithms for imbalanced data streams. Additionally, we
formulate open challenges and future directions for this domain. Our
experimental testbed is fully reproducible and easy to extend with new methods.
This way we propose the first standardized approach to conducting experiments
in imbalanced data streams that can be used by other researchers to create
trustworthy and fair evaluation of newly proposed methods. Our experimental
framework can be downloaded from
https://github.com/canoalberto/imbalanced-streams
Ensemble deep learning: A review
Ensemble learning combines several individual models to obtain better
generalization performance. Currently, deep learning models with multilayer
processing architecture is showing better performance as compared to the
shallow or traditional classification models. Deep ensemble learning models
combine the advantages of both the deep learning models as well as the ensemble
learning such that the final model has better generalization performance. This
paper reviews the state-of-art deep ensemble models and hence serves as an
extensive summary for the researchers. The ensemble models are broadly
categorised into ensemble models like bagging, boosting and stacking, negative
correlation based deep ensemble models, explicit/implicit ensembles,
homogeneous /heterogeneous ensemble, decision fusion strategies, unsupervised,
semi-supervised, reinforcement learning and online/incremental, multilabel
based deep ensemble models. Application of deep ensemble models in different
domains is also briefly discussed. Finally, we conclude this paper with some
future recommendations and research directions
Gaining Insight into Determinants of Physical Activity using Bayesian Network Learning
Contains fulltext :
228326pre.pdf (preprint version ) (Open Access)
Contains fulltext :
228326pub.pdf (publisher's version ) (Open Access)BNAIC/BeneLearn 202
A survey on online active learning
Online active learning is a paradigm in machine learning that aims to select
the most informative data points to label from a data stream. The problem of
minimizing the cost associated with collecting labeled observations has gained
a lot of attention in recent years, particularly in real-world applications
where data is only available in an unlabeled form. Annotating each observation
can be time-consuming and costly, making it difficult to obtain large amounts
of labeled data. To overcome this issue, many active learning strategies have
been proposed in the last decades, aiming to select the most informative
observations for labeling in order to improve the performance of machine
learning models. These approaches can be broadly divided into two categories:
static pool-based and stream-based active learning. Pool-based active learning
involves selecting a subset of observations from a closed pool of unlabeled
data, and it has been the focus of many surveys and literature reviews.
However, the growing availability of data streams has led to an increase in the
number of approaches that focus on online active learning, which involves
continuously selecting and labeling observations as they arrive in a stream.
This work aims to provide an overview of the most recently proposed approaches
for selecting the most informative observations from data streams in the
context of online active learning. We review the various techniques that have
been proposed and discuss their strengths and limitations, as well as the
challenges and opportunities that exist in this area of research. Our review
aims to provide a comprehensive and up-to-date overview of the field and to
highlight directions for future work
Application of data analytics for predictive maintenance in aerospace: an approach to imbalanced learning.
The use of aircraft operational logs to predict potential failure that may lead to disruption poses many challenges and has yet to be fully explored. These logs are captured during each flight and contain streamed data from various aircraft subsystems relating to status and warning indicators. They may, therefore, be regarded as complex multivariate time-series data. Given that aircraft are high-integrity assets, failures are extremely rare, and hence the distribution of relevant data containing prior indicators will be highly skewed to the normal (healthy) case. This will present a significant challenge in using data-driven techniques to 'learning' relationships/patterns that depict fault scenarios since the model will be biased to the heavily weighted no-fault outcomes.
This thesis aims to develop a predictive model for aircraft component failure utilising data from the aircraft central maintenance system (ACMS). The initial objective is to determine the suitability of the ACMS data for predictive maintenance modelling. An exploratory analysis of the data revealed several inherent irregularities, including an extreme data imbalance problem, irregular patterns and trends, class overlapping, and small class disjunct, all of which are significant drawbacks for traditional machine learning algorithms, resulting in low-performance models. Four novel advanced imbalanced classification techniques are developed to handle the identified data irregularities. The first algorithm focuses on pattern extraction and uses bootstrapping to oversample the minority class; the second algorithm employs the balanced calibrated hybrid ensemble technique to overcome class overlapping and small class disjunct; the third algorithm uses a derived loss function and new network architecture to handle extremely imbalanced ratios in deep neural networks; and finally, a deep reinforcement learning approach for imbalanced classification problems in log-
based datasets is developed.
An ACMS dataset and its accompanying maintenance records were used to validate the proposed algorithms. The research's overall finding indicates that an advanced method for handling extremely imbalanced problems using the log-based ACMS datasets is viable for developing robust data-driven predictive maintenance models for aircraft component failure. When the four implementations were compared, deep reinforcement learning (DRL) strategies, specifically the proposed double deep State-action-reward-state-action with prioritised experience reply memory (DDSARSA+PER), outperformed other methods in terms of false-positive and false-negative rates for all the components considered. The validation result further suggests that the DDSARSA+PER model is capable of predicting around 90% of aircraft component replacements with a 0.005 false-negative rate in both A330 and A320 aircraft families studied in this researchPhD in Transport System
Video Deepfake Classification Using Particle Swarm Optimization-based Evolving Ensemble Models
The recent breakthrough of deep learning based generative models has led to the escalated generation of photo-realistic synthetic videos with significant visual quality. Automated reliable detection of such forged videos requires the extraction of fine-grained discriminative spatial-temporal cues. To tackle such challenges, we propose weighted and evolving ensemble models comprising 3D Convolutional Neural Networks (CNNs) and CNN-Recurrent Neural Networks (RNNs) with Particle Swarm Optimization (PSO) based network topology and hyper-parameter optimization for video authenticity classification. A new PSO algorithm is proposed, which embeds Muller’s method and fixed-point iteration based leader enhancement, reinforcement learning-based optimal search action selection, a petal spiral simulated search mechanism, and cross-breed elite signal generation based on adaptive geometric surfaces. The PSO variant optimizes the RNN topologies in CNN-RNN, as well as key learning configurations of 3D CNNs, with the attempt to extract effective discriminative spatial-temporal cues. Both weighted and evolving ensemble strategies are used for ensemble formulation with aforementioned optimized networks as base classifiers. In particular, the proposed PSO algorithm is used to identify optimal subsets of optimized base networks for dynamic ensemble generation to balance between ensemble complexity and performance. Evaluated using several well-known synthetic video datasets, our approach outperforms existing studies and various ensemble models devised by other search methods with statistical significance for video authenticity classification. The proposed PSO model also illustrates statistical superiority over a number of search methods for solving optimization problems pertaining to a variety of artificial landscapes with diverse geometrical layouts
Text Classification: A Review, Empirical, and Experimental Evaluation
The explosive and widespread growth of data necessitates the use of text
classification to extract crucial information from vast amounts of data.
Consequently, there has been a surge of research in both classical and deep
learning text classification methods. Despite the numerous methods proposed in
the literature, there is still a pressing need for a comprehensive and
up-to-date survey. Existing survey papers categorize algorithms for text
classification into broad classes, which can lead to the misclassification of
unrelated algorithms and incorrect assessments of their qualities and behaviors
using the same metrics. To address these limitations, our paper introduces a
novel methodological taxonomy that classifies algorithms hierarchically into
fine-grained classes and specific techniques. The taxonomy includes methodology
categories, methodology techniques, and methodology sub-techniques. Our study
is the first survey to utilize this methodological taxonomy for classifying
algorithms for text classification. Furthermore, our study also conducts
empirical evaluation and experimental comparisons and rankings of different
algorithms that employ the same specific sub-technique, different
sub-techniques within the same technique, different techniques within the same
category, and categorie
Data Balancing Techniques for Predicting Student Dropout Using Machine Learning
This research article was published MDPI, 2023Predicting student dropout is a challenging problem in the education sector. This is due to an imbalance in student dropout data, mainly because the number of registered students is always higher than the number of dropout students. Developing a model without taking the data imbalance issue into account may lead to an ungeneralized model. In this study, different data balancing techniques were applied to improve prediction accuracy in the minority class while maintaining a satisfactory overall classification performance. Random Over Sampling, Random Under Sampling, Synthetic Minority Over Sampling, SMOTE with Edited Nearest Neighbor and SMOTE with Tomek links were tested, along with three popular classification models: Logistic Regression, Random Forest, and Multi-Layer Perceptron. Publicly accessible datasets from Tanzania and India were used to evaluate the effectiveness of balancing techniques and prediction models. The results indicate that SMOTE with Edited Nearest Neighbor achieved the best classification performance on the 10-fold holdout sample. Furthermore, Logistic Regression correctly classified the largest number of dropout students (57348 for the Uwezo dataset and 13430 for the India dataset) using the confusion matrix as the evaluation matrix. The applications of these models allow for the precise prediction of at-risk students and the reduction of dropout rates
Explainable adaptation of time series forecasting
A time series is a collection of data points captured over time, commonly found in many fields such as healthcare, manufacturing, and transportation. Accurately predicting the future behavior of a time series is crucial for decision-making, and several Machine Learning (ML) models have been applied to solve this task. However, changes in the time series, known as concept drift, can affect model generalization to future data, requiring thus online adaptive forecasting methods. This thesis aims to extend the State-of-the-Art (SoA) in the ML literature for time series forecasting by developing novel online adaptive methods.
The first part focuses on online time series forecasting, including a framework for selecting time series variables and developing ensemble models that are adaptive to changes in time series data and model performance. Empirical results show the usefulness and competitiveness of the developed methods and their contribution to the explainability of both model selection and ensemble pruning processes. Regarding the second part, the thesis contributes to the literature on online ML model-based quality prediction for three Industry 4.0 applications: NC-milling, bolt installation in the automotive industry, and Surface Mount Technology (SMT) in electronics manufacturing. The thesis shows how process simulation can be used to generate additional knowledge and how such knowledge can be integrated efficiently into the ML process. The thesis also presents two applications of explainable model-based quality prediction and their impact on smart industry practices
- …