434 research outputs found
Oversampling technique in student performance classification from engineering course
The first year of an engineering student was important to take proper academic planning. All subjects in the first year were essential for an engineering basis. Student performance prediction helped academics improve their performance better. Students checked performance by themselves. If they were aware that their performance are low, then they could make some improvement for their better performance. This research focused on combining the oversampling minority class data with various kinds of classifier models. Oversampling techniques were SMOTE, Borderline-SMOTE, SVMSMOTE, and ADASYN and four classifiers were applied using MLP, gradient boosting, AdaBoost and random forest in this research. The results represented that Borderline-SMOTE gave the best result for minority class prediction with several classifiers
SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary
The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is
considered \de facto" standard in the framework of learning from imbalanced data. This
is due to its simplicity in the design of the procedure, as well as its robustness when applied
to di erent type of problems. Since its publication in 2002, SMOTE has proven
successful in a variety of applications from several di erent domains. SMOTE has also inspired
several approaches to counter the issue of class imbalance, and has also signi cantly
contributed to new supervised learning paradigms, including multilabel classi cation, incremental
learning, semi-supervised learning, multi-instance learning, among others. It is
standard benchmark for learning from imbalanced data. It is also featured in a number of
di erent software packages | from open source to commercial. In this paper, marking the
fteen year anniversary of SMOTE, we re
ect on the SMOTE journey, discuss the current
state of a airs with SMOTE, its applications, and also identify the next set of challenges
to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology
under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project
887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016;
and the National Science Foundation (NSF) Grant IIS-1447795
Advances in machine learning algorithms for financial risk management
In this thesis, three novel machine learning techniques are introduced to address distinct
yet interrelated challenges involved in financial risk management tasks. These approaches
collectively offer a comprehensive strategy, beginning with the precise classification of credit
risks, advancing through the nuanced forecasting of financial asset volatility, and ending
with the strategic optimisation of financial asset portfolios.
Firstly, a Hybrid Dual-Resampling and Cost-Sensitive technique has been proposed to combat the prevalent issue of class imbalance in financial datasets, particularly in credit risk
assessment. The key process involves the creation of heuristically balanced datasets to effectively address the problem. It uses a resampling technique based on Gaussian mixture
modelling to generate a synthetic minority class from the minority class data and concurrently uses k-means clustering on the majority class. Feature selection is then performed
using the Extra Tree Ensemble technique. Subsequently, a cost-sensitive logistic regression
model is then applied to predict the probability of default using the heuristically balanced
datasets. The results underscore the effectiveness of our proposed technique, with superior
performance observed in comparison to other imbalanced preprocessing approaches. This
advancement in credit risk classification lays a solid foundation for understanding individual
financial behaviours, a crucial first step in the broader context of financial risk management.
Building on this foundation, the thesis then explores the forecasting of financial asset volatility, a critical aspect of understanding market dynamics. A novel model that combines a
Triple Discriminator Generative Adversarial Network with a continuous wavelet transform
is proposed. The proposed model has the ability to decompose volatility time series into
signal-like and noise-like frequency components, to allow the separate detection and monitoring of non-stationary volatility data. The network comprises of a wavelet transform
component consisting of continuous wavelet transforms and inverse wavelet transform components, an auto-encoder component made up of encoder and decoder networks, and a
Generative Adversarial Network consisting of triple Discriminator and Generator networks.
The proposed Generative Adversarial Network employs an ensemble of unsupervised loss derived from the Generative Adversarial Network component during training, supervised
loss and reconstruction loss as part of its framework. Data from nine financial assets are
employed to demonstrate the effectiveness of the proposed model. This approach not only
enhances our understanding of market fluctuations but also bridges the gap between individual credit risk assessment and macro-level market analysis.
Finally the thesis ends with a novel proposal of a novel technique or Portfolio optimisation. This involves the use of a model-free reinforcement learning strategy for portfolio
optimisation using historical Low, High, and Close prices of assets as input with weights of
assets as output. A deep Capsules Network is employed to simulate the investment strategy, which involves the reallocation of the different assets to maximise the expected return
on investment based on deep reinforcement learning. To provide more learning stability in
an online training process, a Markov Differential Sharpe Ratio reward function has been
proposed as the reinforcement learning objective function. Additionally, a Multi-Memory
Weight Reservoir has also been introduced to facilitate the learning process and optimisation of computed asset weights, helping to sequentially re-balance the portfolio throughout
a specified trading period. The use of the insights gained from volatility forecasting into
this strategy shows the interconnected nature of the financial markets. Comparative experiments with other models demonstrated that our proposed technique is capable of achieving
superior results based on risk-adjusted reward performance measures.
In a nut-shell, this thesis not only addresses individual challenges in financial risk management but it also incorporates them into a comprehensive framework; from enhancing the
accuracy of credit risk classification, through the improvement and understanding of market
volatility, to optimisation of investment strategies. These methodologies collectively show
the potential of the use of machine learning to improve financial risk management
Click Fraud Detection in Online and In-app Advertisements: A Learning Based Approach
Click Fraud is the fraudulent act of clicking on pay-per-click advertisements to increase a site’s revenue, to drain revenue from the advertiser, or to inflate the popularity of content on social media platforms. In-app advertisements on mobile platforms are among the most common targets for click fraud, which makes companies hesitant to advertise their products. Fraudulent clicks are supposed to be caught by ad providers as part of their service to advertisers, which is commonly done using machine learning methods. However: (1) there is a lack of research in current literature addressing and evaluating the different techniques of click fraud detection and prevention, (2) threat models composed of active learning systems (smart attackers) can mislead the training process of the fraud detection model by polluting the training data, (3) current deep learning models have significant computational overhead, (4) training data is often in an imbalanced state, and balancing it still results in noisy data that can train the classifier incorrectly, and (5) datasets with high dimensionality cause increased computational overhead and decreased classifier correctness -- while existing feature selection techniques address this issue, they have their own performance limitations. By extending the state-of-the-art techniques in the field of machine learning, this dissertation provides the following solutions: (i) To address (1) and (2), we propose a hybrid deep-learning-based model which consists of an artificial neural network, auto-encoder and semi-supervised generative adversarial network. (ii) As a solution for (3), we present Cascaded Forest and Extreme Gradient Boosting with less hyperparameter tuning. (iii) To overcome (4), we propose a row-wise data reduction method, KSMOTE, which filters out noisy data samples both in the raw data and the synthetically generated samples. (iv) For (5), we propose different column-reduction methods such as multi-time-scale Time Series analysis for fraud forecasting, using binary labeled imbalanced datasets and hybrid filter-wrapper feature selection approaches
A Hybrid Ensemble Method for Multiclass Classification and Outlier Detection
Multiclass problem has continued to be an active research area due to the challenges paused by the issue of imbalance datasets and lack of a unifying classification algorithms. Real world problems are of multiclass nature with skewed representations. The study focused on the challenges of multiclass classification. Multiclass datasets were adopted from UCI machine learning repository. The research developed a heterogeneous ensemble model for multiclass classification and outlier detection that combined several strategies and ensemble techniques. Preprocessing involved filtering global outliers and resampling datasets using synthetic minority oversampling technique (SMOTE) algorithm. Datasets binarization was done using OnevsOne decomposing technique. Heterogeneous ensemble model was constructed using adaboost, random subspace algorithms and random forest as the base classifier. The classifiers built were combined using average of probabilities voting rule and evaluated using 10 fold stratified cross validation. The model showed better performance in terms of outlier detection and classification prediction for multiclass problem. The model outperformed other commonly used classical algorithms. The study findings established proper preprocessing and decomposing multiclass results in an improved performance of minority outlier classes while safe guarding integrity of the majority classes
The Role of Synthetic Data in Improving Supervised Learning Methods: The Case of Land Use/Land Cover Classification
A thesis submitted in partial fulfillment of the requirements for the degree of Doctor in Information ManagementIn remote sensing, Land Use/Land Cover (LULC) maps constitute important assets for
various applications, promoting environmental sustainability and good resource management.
Although, their production continues to be a challenging task. There are various factors
that contribute towards the difficulty of generating accurate, timely updated LULC maps,
both via automatic or photo-interpreted LULC mapping. Data preprocessing, being a
crucial step for any Machine Learning task, is particularly important in the remote sensing
domain due to the overwhelming amount of raw, unlabeled data continuously gathered
from multiple remote sensing missions. However a significant part of the state-of-the-art
focuses on scenarios with full access to labeled training data with relatively balanced class
distributions. This thesis focuses on the challenges found in automatic LULC classification
tasks, specifically in data preprocessing tasks. We focus on the development of novel
Active Learning (AL) and imbalanced learning techniques, to improve ML performance in
situations with limited training data and/or the existence of rare classes. We also show
that much of the contributions presented are not only successful in remote sensing problems,
but also in various other multidisciplinary classification problems. The work presented
in this thesis used open access datasets to test the contributions made in imbalanced
learning and AL. All the data pulling, preprocessing and experiments are made available at
https://github.com/joaopfonseca/publications. The algorithmic implementations are made
available in the Python package ml-research at https://github.com/joaopfonseca/ml-research
Transcriptomics in Toxicogenomics, Part III: Data Modelling for Risk Assessment
Transcriptomics data are relevant to address a number of challenges in Toxicogenomics (TGx). After careful planning of exposure conditions and data preprocessing, the TGx data can be used in predictive toxicology, where more advanced modelling techniques are applied. The large volume of molecular profiles produced by omics-based technologies allows the development and application of artificial intelligence (AI) methods in TGx. Indeed, the publicly available omics datasets are constantly increasing together with a plethora of different methods that are made available to facilitate their analysis, interpretation and the generation of accurate and stable predictive models. In this review, we present the state-of-the-art of data modelling applied to transcriptomics data in TGx. We show how the benchmark dose (BMD) analysis can be applied to TGx data. We review read across and adverse outcome pathways (AOP) modelling methodologies. We discuss how network-based approaches can be successfully employed to clarify the mechanism of action (MOA) or specific biomarkers of exposure. We also describe the main AI methodologies applied to TGx data to create predictive classification and regression models and we address current challenges. Finally, we present a short description of deep learning (DL) and data integration methodologies applied in these contexts. Modelling of TGx data represents a valuable tool for more accurate chemical safety assessment. This review is the third part of a three-article series on Transcriptomics in Toxicogenomics
Transcriptomics in Toxicogenomics, Part III : Data Modelling for Risk Assessment
Transcriptomics data are relevant to address a number of challenges in Toxicogenomics (TGx). After careful planning of exposure conditions and data preprocessing, the TGx data can be used in predictive toxicology, where more advanced modelling techniques are applied. The large volume of molecular profiles produced by omics-based technologies allows the development and application of artificial intelligence (AI) methods in TGx. Indeed, the publicly available omics datasets are constantly increasing together with a plethora of different methods that are made available to facilitate their analysis, interpretation and the generation of accurate and stable predictive models. In this review, we present the state-of-the-art of data modelling applied to transcriptomics data in TGx. We show how the benchmark dose (BMD) analysis can be applied to TGx data. We review read across and adverse outcome pathways (AOP) modelling methodologies. We discuss how network-based approaches can be successfully employed to clarify the mechanism of action (MOA) or specific biomarkers of exposure. We also describe the main AI methodologies applied to TGx data to create predictive classification and regression models and we address current challenges. Finally, we present a short description of deep learning (DL) and data integration methodologies applied in these contexts. Modelling of TGx data represents a valuable tool for more accurate chemical safety assessment. This review is the third part of a three-article series on Transcriptomics in Toxicogenomics.Peer reviewe
Ensemble methods for meningitis aetiology diagnosis
In this work, we explore data-driven techniques for the fast and early diagnosis concerning the etiological origin of meningitis, more specifically with regard to differentiating between viral and bacterial meningitis. We study how machine learning can be used to predict meningitis aetiology once a patient has been diagnosed with this disease. We have a dataset of 26,228 patients described by 19 attributes, mainly about the patient's observable symptoms and the early results of the cerebrospinal fluid analysis. Using this dataset, we have explored several techniques of dataset sampling, feature selection and classification models based both on ensemble methods and on simple techniques (mainly, decision trees). Experiments with 27 classification models (19 of them involving ensemble methods) have been conducted for this paper. Our main finding is that the combination of ensemble methods with decision trees leads to the best meningitis aetiology classifiers. The best performance indicator values (precision, recall and f-measure of 89% and an AUC value of 95%) have been achieved by the synergy between bagging and NBTrees. Nonetheless, our results also suggest that the combination of ensemble methods with certain decision tree clearly improves the performance of diagnosis in comparison with those obtained with only the corresponding decision tree.This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. We would like to thank the Health Department of the Brazilian Government for providing the dataset and for authorizing its use in this study. We would also like to express our gratitude to the reviewers for their thoughtful comments and efforts towards improving our manuscript. Funding for open access charge: Universidad de Málaga / CBUA
- …