598 research outputs found

    A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

    Full text link
    Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures on how to evaluate these algorithms. This work presents a taxonomy of algorithms for imbalanced data streams and proposes a standardized, exhaustive, and informative experimental testbed to evaluate algorithms in a collection of diverse and challenging imbalanced data stream scenarios. The experimental study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced data streams that combine static and dynamic class imbalance ratios, instance-level difficulties, concept drift, real-world and semi-synthetic datasets in binary and multi-class scenarios. This leads to the largest experimental study conducted so far in the data stream mining domain. We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced data streams. Additionally, we formulate open challenges and future directions for this domain. Our experimental testbed is fully reproducible and easy to extend with new methods. This way we propose the first standardized approach to conducting experiments in imbalanced data streams that can be used by other researchers to create trustworthy and fair evaluation of newly proposed methods. Our experimental framework can be downloaded from https://github.com/canoalberto/imbalanced-streams

    Ensemble deep learning: A review

    Get PDF
    Ensemble learning combines several individual models to obtain better generalization performance. Currently, deep learning models with multilayer processing architecture is showing better performance as compared to the shallow or traditional classification models. Deep ensemble learning models combine the advantages of both the deep learning models as well as the ensemble learning such that the final model has better generalization performance. This paper reviews the state-of-art deep ensemble models and hence serves as an extensive summary for the researchers. The ensemble models are broadly categorised into ensemble models like bagging, boosting and stacking, negative correlation based deep ensemble models, explicit/implicit ensembles, homogeneous /heterogeneous ensemble, decision fusion strategies, unsupervised, semi-supervised, reinforcement learning and online/incremental, multilabel based deep ensemble models. Application of deep ensemble models in different domains is also briefly discussed. Finally, we conclude this paper with some future recommendations and research directions

    Gaining Insight into Determinants of Physical Activity using Bayesian Network Learning

    Get PDF
    Contains fulltext : 228326pre.pdf (preprint version ) (Open Access) Contains fulltext : 228326pub.pdf (publisher's version ) (Open Access)BNAIC/BeneLearn 202

    A survey on online active learning

    Full text link
    Online active learning is a paradigm in machine learning that aims to select the most informative data points to label from a data stream. The problem of minimizing the cost associated with collecting labeled observations has gained a lot of attention in recent years, particularly in real-world applications where data is only available in an unlabeled form. Annotating each observation can be time-consuming and costly, making it difficult to obtain large amounts of labeled data. To overcome this issue, many active learning strategies have been proposed in the last decades, aiming to select the most informative observations for labeling in order to improve the performance of machine learning models. These approaches can be broadly divided into two categories: static pool-based and stream-based active learning. Pool-based active learning involves selecting a subset of observations from a closed pool of unlabeled data, and it has been the focus of many surveys and literature reviews. However, the growing availability of data streams has led to an increase in the number of approaches that focus on online active learning, which involves continuously selecting and labeling observations as they arrive in a stream. This work aims to provide an overview of the most recently proposed approaches for selecting the most informative observations from data streams in the context of online active learning. We review the various techniques that have been proposed and discuss their strengths and limitations, as well as the challenges and opportunities that exist in this area of research. Our review aims to provide a comprehensive and up-to-date overview of the field and to highlight directions for future work

    Application of data analytics for predictive maintenance in aerospace: an approach to imbalanced learning.

    Get PDF
    The use of aircraft operational logs to predict potential failure that may lead to disruption poses many challenges and has yet to be fully explored. These logs are captured during each flight and contain streamed data from various aircraft subsystems relating to status and warning indicators. They may, therefore, be regarded as complex multivariate time-series data. Given that aircraft are high-integrity assets, failures are extremely rare, and hence the distribution of relevant data containing prior indicators will be highly skewed to the normal (healthy) case. This will present a significant challenge in using data-driven techniques to 'learning' relationships/patterns that depict fault scenarios since the model will be biased to the heavily weighted no-fault outcomes. This thesis aims to develop a predictive model for aircraft component failure utilising data from the aircraft central maintenance system (ACMS). The initial objective is to determine the suitability of the ACMS data for predictive maintenance modelling. An exploratory analysis of the data revealed several inherent irregularities, including an extreme data imbalance problem, irregular patterns and trends, class overlapping, and small class disjunct, all of which are significant drawbacks for traditional machine learning algorithms, resulting in low-performance models. Four novel advanced imbalanced classification techniques are developed to handle the identified data irregularities. The first algorithm focuses on pattern extraction and uses bootstrapping to oversample the minority class; the second algorithm employs the balanced calibrated hybrid ensemble technique to overcome class overlapping and small class disjunct; the third algorithm uses a derived loss function and new network architecture to handle extremely imbalanced ratios in deep neural networks; and finally, a deep reinforcement learning approach for imbalanced classification problems in log- based datasets is developed. An ACMS dataset and its accompanying maintenance records were used to validate the proposed algorithms. The research's overall finding indicates that an advanced method for handling extremely imbalanced problems using the log-based ACMS datasets is viable for developing robust data-driven predictive maintenance models for aircraft component failure. When the four implementations were compared, deep reinforcement learning (DRL) strategies, specifically the proposed double deep State-action-reward-state-action with prioritised experience reply memory (DDSARSA+PER), outperformed other methods in terms of false-positive and false-negative rates for all the components considered. The validation result further suggests that the DDSARSA+PER model is capable of predicting around 90% of aircraft component replacements with a 0.005 false-negative rate in both A330 and A320 aircraft families studied in this researchPhD in Transport System

    Video Deepfake Classification Using Particle Swarm Optimization-based Evolving Ensemble Models

    Get PDF
    The recent breakthrough of deep learning based generative models has led to the escalated generation of photo-realistic synthetic videos with significant visual quality. Automated reliable detection of such forged videos requires the extraction of fine-grained discriminative spatial-temporal cues. To tackle such challenges, we propose weighted and evolving ensemble models comprising 3D Convolutional Neural Networks (CNNs) and CNN-Recurrent Neural Networks (RNNs) with Particle Swarm Optimization (PSO) based network topology and hyper-parameter optimization for video authenticity classification. A new PSO algorithm is proposed, which embeds Muller’s method and fixed-point iteration based leader enhancement, reinforcement learning-based optimal search action selection, a petal spiral simulated search mechanism, and cross-breed elite signal generation based on adaptive geometric surfaces. The PSO variant optimizes the RNN topologies in CNN-RNN, as well as key learning configurations of 3D CNNs, with the attempt to extract effective discriminative spatial-temporal cues. Both weighted and evolving ensemble strategies are used for ensemble formulation with aforementioned optimized networks as base classifiers. In particular, the proposed PSO algorithm is used to identify optimal subsets of optimized base networks for dynamic ensemble generation to balance between ensemble complexity and performance. Evaluated using several well-known synthetic video datasets, our approach outperforms existing studies and various ensemble models devised by other search methods with statistical significance for video authenticity classification. The proposed PSO model also illustrates statistical superiority over a number of search methods for solving optimization problems pertaining to a variety of artificial landscapes with diverse geometrical layouts

    Text Classification: A Review, Empirical, and Experimental Evaluation

    Full text link
    The explosive and widespread growth of data necessitates the use of text classification to extract crucial information from vast amounts of data. Consequently, there has been a surge of research in both classical and deep learning text classification methods. Despite the numerous methods proposed in the literature, there is still a pressing need for a comprehensive and up-to-date survey. Existing survey papers categorize algorithms for text classification into broad classes, which can lead to the misclassification of unrelated algorithms and incorrect assessments of their qualities and behaviors using the same metrics. To address these limitations, our paper introduces a novel methodological taxonomy that classifies algorithms hierarchically into fine-grained classes and specific techniques. The taxonomy includes methodology categories, methodology techniques, and methodology sub-techniques. Our study is the first survey to utilize this methodological taxonomy for classifying algorithms for text classification. Furthermore, our study also conducts empirical evaluation and experimental comparisons and rankings of different algorithms that employ the same specific sub-technique, different sub-techniques within the same technique, different techniques within the same category, and categorie

    Data Balancing Techniques for Predicting Student Dropout Using Machine Learning

    Get PDF
    This research article was published MDPI, 2023Predicting student dropout is a challenging problem in the education sector. This is due to an imbalance in student dropout data, mainly because the number of registered students is always higher than the number of dropout students. Developing a model without taking the data imbalance issue into account may lead to an ungeneralized model. In this study, different data balancing techniques were applied to improve prediction accuracy in the minority class while maintaining a satisfactory overall classification performance. Random Over Sampling, Random Under Sampling, Synthetic Minority Over Sampling, SMOTE with Edited Nearest Neighbor and SMOTE with Tomek links were tested, along with three popular classification models: Logistic Regression, Random Forest, and Multi-Layer Perceptron. Publicly accessible datasets from Tanzania and India were used to evaluate the effectiveness of balancing techniques and prediction models. The results indicate that SMOTE with Edited Nearest Neighbor achieved the best classification performance on the 10-fold holdout sample. Furthermore, Logistic Regression correctly classified the largest number of dropout students (57348 for the Uwezo dataset and 13430 for the India dataset) using the confusion matrix as the evaluation matrix. The applications of these models allow for the precise prediction of at-risk students and the reduction of dropout rates

    Explainable adaptation of time series forecasting

    Get PDF
    A time series is a collection of data points captured over time, commonly found in many fields such as healthcare, manufacturing, and transportation. Accurately predicting the future behavior of a time series is crucial for decision-making, and several Machine Learning (ML) models have been applied to solve this task. However, changes in the time series, known as concept drift, can affect model generalization to future data, requiring thus online adaptive forecasting methods. This thesis aims to extend the State-of-the-Art (SoA) in the ML literature for time series forecasting by developing novel online adaptive methods. The first part focuses on online time series forecasting, including a framework for selecting time series variables and developing ensemble models that are adaptive to changes in time series data and model performance. Empirical results show the usefulness and competitiveness of the developed methods and their contribution to the explainability of both model selection and ensemble pruning processes. Regarding the second part, the thesis contributes to the literature on online ML model-based quality prediction for three Industry 4.0 applications: NC-milling, bolt installation in the automotive industry, and Surface Mount Technology (SMT) in electronics manufacturing. The thesis shows how process simulation can be used to generate additional knowledge and how such knowledge can be integrated efficiently into the ML process. The thesis also presents two applications of explainable model-based quality prediction and their impact on smart industry practices
    corecore