Search CORE

2,170 research outputs found

On the suitability of resampling techniques for the class imbalance problem in credit scoring

Author: A I Marqués
Abrahams CR
Chawla NV
Demšar J
Hochberg Y
J S Sánchez
Japkowicz N
Pluto K
Thomas LC
V García
Vinciotti V
Yen S-J
Zar JH
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

In real-life credit scoring applications, the case in which the class of defaulters is under-represented in comparison with the class of non-defaulters is a very common situation, but it has still received little attention. The present paper investigates the suitability and performance of several resampling techniques when applied in conjunction with statistical and artificial intelligence prediction models over five real-world credit data sets, which have artificially been modified to derive different imbalance ratios (proportion of defaulters and non-defaulters examples). Experimental results demonstrate that the use of resampling methods consistently improves the performance given by the original imbalanced data. Besides, it is also important to note that in general, over-sampling techniques perform better than any under-sampling approach.This work has partially been supported by the Spanish Ministry of Education and Science under grant TIN2009– 14205 and the Generalitat Valenciana under grant PROMETEO/2010/ 028

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Repositori Institucional de la Universitat Jaume I

An efficiency curve for evaluating imbalanced classifiers considering intrinsic data characteristics: Experimental analysis

Author: Chao Xiangrui
Fernández Hilario Alberto Luis
Publication venue: 'Elsevier BV'
Publication date: 22/06/2022
Field of study

Balancing the accuracy rates of the majority and minority classes is challenging in imbalanced classification. Furthermore, data characteristics have a significant impact on the performance of imbalanced classifiers, which are generally neglected by existing evaluation methods. The objective of this study is to introduce a new criterion to comprehensively evaluate imbalanced classifiers. Specifically, we introduce an efficiency curve that is established using data envelopment analysis without explicit inputs (DEA-WEI), to determine the trade-off between the benefits of improved minority class accuracy and the cost of reduced majority class accuracy. In sequence, we analyze the impact of the imbalanced ratio and typical imbalanced data characteristics on the efficiency of the classifiers. Empirical analyses using 68 imbalanced data reveal that traditional classifiers such as C4.5 and the k-nearest neighbor are more effective on disjunct data, whereas ensemble and undersampling techniques are more effective for overlapping and noisy data. The efficiency of cost-sensitive classifiers decreases dramatically when the imbalanced ratio increases. Finally, we investigate the reasons for the different efficiencies of classifiers on imbalanced data and recommend steps to select appropriate classifiers for imbalanced data based on data characteristics.National Natural Science Foundation of China (NSFC) 71874023 71725001 71771037 7197104

Repositorio Institucional Universidad de Granada

A New Improved Prediction of Software Defects Using Machine Learning-based Boosting Techniques with NASA Dataset

Author: Goyal Jayanti
Sinha Ripu Ranjan
Publication venue: Auricle Global Society of Education and Research
Publication date: 07/10/2023
Field of study

Predicting when and where bugs will appear in software may assist improve quality and save on software testing expenses. Predicting bugs in individual modules of software by utilizing machine learning methods. There are, however, two major problems with the software defect prediction dataset: Social stratification (there are many fewer faulty modules than non-defective ones), and noisy characteristics (a result of irrelevant features) that make accurate predictions difficult. The performance of the machine learning model will suffer greatly if these two issues arise. Overfitting will occur, and biassed classification findings will be the end consequence. In this research, we suggest using machine learning approaches to enhance the usefulness of the CatBoost and Gradient Boost classifiers while predicting software flaws. Both the Random Over Sampler and Mutual info classification methods address the class imbalance and feature selection issues inherent in software fault prediction. Eleven datasets from NASA's data repository, "Promise," were utilised in this study. Using 10-fold cross-validation, we classified these 11 datasets and found that our suggested technique outperformed the baseline by a significant margin. The proposed methods have been evaluated based on their abilities to anticipate software defects using the most important indices available: Accuracy, Precision, Recall, F1 score, ROC values, RMSE, MSE, and MAE parameters. For all 11 datasets evaluated, the suggested methods outperform baseline classifiers by a significant margin. We tested our model to other methods of flaw identification and found that it outperformed them all. The computational detection rate of the suggested model is higher than that of conventional models, as shown by the experiments.

International Journal on Recent and Innovation Trends in Computing and Communication

Predictive Framework for Imbalance Dataset

Author: Megat Norulazmi Megat Mohamed Noor
Publication venue
Publication date: 01/01/2012
Field of study

The purpose of this research is to seek and propose a new predictive maintenance framework which can be used to generate a prediction model for deterioration of process materials. Real yield data which was obtained from Fuji Electric Malaysia has been used in this research. The existing data pre-processing and classification methodologies have been adapted in this research. Properties of the proposed framework include; developing an approach to correlate materials defects, developing an approach to represent data attributes features, analyzing various ratio and types of data re-sampling, analyzing the impact of data dimension reduction for various data size, and partitioning data size and algorithmic schemes against the prediction performance. Experimental results suggested that the class probability distribution function of a prediction model has to be closer to a training dataset; less skewed environment enable learning schemes to discover better function F in a bigger Fall space within a higher dimensional feature space, data sampling and partition size is appear to proportionally improve the precision and recall if class distribution ratios are balanced. A comparative study was also conducted and showed that the proposed approaches have performed better. This research was conducted based on limited number of datasets, test sets and variables. Thus, the obtained results are applicable only to the study domain with selected datasets. This research has introduced a new predictive maintenance framework which can be used in manufacturing industries to generate a prediction model based on the deterioration of process materials. Consequently, this may allow manufactures to conduct predictive maintenance not only for equipments but also process materials. The major contribution of this research is a step by step guideline which consists of methods/approaches in generating a prediction for process materials

Universiti Utara Malaysia: UUM eTheses

Automating Fault Detection and Quality Control in PCBs: A Machine Learning Approach to Handle Imbalanced Data

Author: Mirzaei Mehrnaz
Publication venue
Publication date: 12/09/2023
Field of study

Printed Circuit Boards (PCBs) are fundamental to the operation of a wide array of electronic devices, from consumer electronics to sophisticated industrial machinery. Given this pivotal role, quality control and fault detection are especially significant, as they are essential for ensuring the devices' long-term reliability and efficiency. To address this, the thesis explores advancements in fault detection and quality control methods for PCBs, with a focus on Machine Learning (ML) and Deep Learning (DL) techniques. The study begins with an in-depth review of traditional approaches like visual and X-ray inspections, then delves into modern, data-driven methods, such as automated anomaly detection in PCB manufacturing using tabular datasets. The core of the thesis is divided into three specific tasks: firstly, applying ML and DL models for anomaly detection in PCBs, particularly focusing on solder-pasting issues and the challenges posed by imbalanced datasets; secondly, predicting human inspection labels through specially designed tabular models like TabNet; and thirdly, implementing multi-classification methods to automate repair labeling on PCBs. The study is structured to offer a comprehensive view, beginning with background information, followed by the methodology and results of each task, and concluding with a summary and directions for future research. Through this systematic approach, the research not only provides new insights into the capabilities and limitations of existing fault detection techniques but also sets the stage for more intelligent and efficient systems in PCB manufacturing and quality control

Concordia University Research Repository

Rails Quality Data Modelling via Machine Learning-Based Paradigms

Author: Zughrat Ali
Publication venue: 'University of Sheffield Conference Proceedings'
Publication date: 01/06/2015
Field of study

White Rose E-theses Online

Customer Shopping Behavior Analysis Using RFID and Machine Learning Models

Author: Alfian Ganjar
Atmaji Fransiskus Tatas Dwi
Farooq Umar
Fitriyani Norma Latif
Hilmy Farhan Mufti
Nguyen Dat Tien
Nurhaliza Rachma Aurya
Octava Muhammad Qois Huzyan
Putri Divi Galih Prasetyo
Saputra Yuris Mulya
Syafrudin Muhammad
Syahrian Firma
Publication venue
Publication date: 08/10/2023
Field of study

Coventry University Pure Portal

Ensemble diversity for class imbalance learning

Author: Wang Shuo
Publication venue
Publication date: 01/01/2011
Field of study

This thesis studies the diversity issue of classification ensembles for class imbalance learning problems. Class imbalance learning refers to learning from imbalanced data sets, in which some classes of examples (minority) are highly under-represented comparing to other classes (majority). The very skewed class distribution degrades the learning ability of many traditional machine learning methods, especially in the recognition of examples from the minority classes, which are often deemed to be more important and interesting. Although quite a few ensemble learning approaches have been proposed to handle the problem, no in-depth research exists to explain why and when they can be helpful. Our objectives are to understand how ensemble diversity affects the classification performance for a class imbalance problem according to single-class and overall performance measures, and to make best use of diversity to improve the performance. As the first stage, we study the relationship between ensemble diversity and generalization performance for class imbalance problems. We investigate mathematical links between single-class performance and ensemble diversity. It is found that how the single-class measures change along with diversity falls into six different situations. These findings are then verified in class imbalance scenarios through empirical studies. The impact of diversity on overall performance is also investigated empirically. Strong correlations between diversity and the performance measures are found. Diversity shows a positive impact on the recognition of the minority class and benefits the overall performance of ensembles in class imbalance learning. Our results help to understand if and why ensemble diversity can help to deal with class imbalance problems. Encouraged by the positive role of diversity in class imbalance learning, we then focus on a specific ensemble learning technique, the negative correlation learning (NCL) algorithm, which considers diversity explicitly when creating ensembles and has achieved great empirical success. We propose a new learning algorithm based on the idea of NCL, named AdaBoost.NC, for classification problems. An ``ambiguity" term decomposed from the 0-1 error function is introduced into the training framework of AdaBoost. It demonstrates superiority in both effectiveness and efficiency. Its good generalization performance is explained by theoretical and empirical evidences. It can be viewed as the first NCL algorithm specializing in classification problems. Most existing ensemble methods for class imbalance problems suffer from the problems of overfitting and over-generalization. To improve this situation, we address the class imbalance issue by making use of ensemble diversity. We investigate the generalization ability of NCL algorithms, including AdaBoost.NC, to tackle two-class imbalance problems. We find that NCL methods integrated with random oversampling are effective in recognizing minority class examples without losing the overall performance, especially the AdaBoost.NC tree ensemble. This is achieved by providing smoother and less overfitting classification boundaries for the minority class. The results here show the usefulness of diversity and open up a novel way to deal with class imbalance problems. Since the two-class imbalance is not the only scenario in real-world applications, multi-class imbalance problems deserve equal attention. To understand what problems multi-class can cause and how it affects the classification performance, we study the multi-class difficulty by analyzing the multi-minority and multi-majority cases respectively. Both lead to a significant performance reduction. The multi-majority case appears to be more harmful. The results reveal possible issues that a class imbalance learning technique could have when dealing with multi-class tasks. Following this part of analysis and the promising results of AdaBoost.NC on two-class imbalance problems, we apply AdaBoost.NC to a set of multi-class imbalance domains with the aim of solving them effectively and directly. Our method shows good generalization in minority classes and balances the performance across different classes well without using any class decomposition schemes. Finally, we conclude this thesis with how the study has contributed to class imbalance learning and ensemble learning, and propose several possible directions for future research that may improve and extend this work

University of Birmingham Research Archive, E-theses Repository

OpenGrey Repository

Towards Enhancement of Machine Learning Techniques Using CSE-CIC-IDS2018 Cybersecurity Dataset

Author: Ravikumar Dharshini
Publication venue: RIT Scholar Works
Publication date: 01/01/2021
Field of study

In machine learning, balanced datasets play a crucial role in the bias observed towards classification and prediction. The CSE-CIC IDS datasets published in 2017 and 2018 have both attracted considerable scholarly attention towards research in intrusion detection systems. Recent work published using this dataset indicates little attention paid to the imbalance of the dataset. The study presented in this paper sets out to explore the degree to which imbalance has been treated and provide a taxonomy of the machine learning approaches developed using these datasets. A survey of published works related to these datasets was done to deliver a combined qualitative and quantitative methodological approach for our analysis towards deriving a taxonomy. The research presented here confirms that the impact of bias due to the imbalance datasets is rarely addressed. This data supports further research and development of supervised machine learning techniques that reduce bias in classification or prediction due to these imbalance datasets. This study\u27s experiment is to train the model using the train, and test split function from sci-kit learn library on the CSE-CIC-IDS2018. The system needs to be trained by a learning algorithm to accomplish this. There are many machine learning algorithms available and presented by the literature. Among which there are three types of classification based Supervised ML techniques which are used in our study: 1) KNN, 2) Random Forest (RF) and 3) Logistic Regression (LR). This experiment also determines how each of the dataset\u27s 67 preprocessed features affects the ML model\u27s performance. Feature drop selection is performed in two ways, independent and group drop. Experimental results generate the threshold values for each classifier and performance metric values such as accuracy, precision, recall, and F1-score. Also, results are generated from the comparison of manual feature drop methods. A good amount of drop is noticed in the group for most of the classifiers

RIT Scholar Works