Search CORE

1,191 research outputs found

Improving Risk Predictions by Preprocessing Imbalanced Credit Data

Author: B. Tian
C. Bunkhumpornpat
C. Phua
D.L. Wilson
G.E.A.P.A. Batista
I. Brown
J. Demšar
J. Laurikkala
K. Kennedy
L.C. Thomas
N. Japkowicz
N.M. Kiefer
N.V. Chawla
P.E. Hart
S.J. Yen
V. Vinciotti
Y.M. Huang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Imbalanced credit data sets refer to databases in which the class of defaulters is heavily under-represented in comparison to the class of non-defaulters. This is a very common situation in real-life credit scoring applications, but it has still received little attention. This paper investigates whether data resampling can be used to improve the performance of learners built from imbalanced credit data sets, and whether the effectiveness of resampling is related to the type of classifier. Experimental results demonstrate that learning with the resampled sets consistently outperforms the use of the original imbalanced credit data, independently of the classifier used

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositori Institucional de la Universitat Jaume I

Efficient treatment of outliers and class imbalance for diabetes prediction

Author: KORKONTZELOS YANNIS
NNAMOKO NONSO
Publication venue: 'Elsevier BV'
Publication date: 30/04/2020
Field of study

Edge Hill University Research Information Repository

Predicting Pancreatic Cancer Using Support Vector Machine

Author: Bodkhe Akshay
Publication venue: SJSU ScholarWorks
Publication date: 26/05/2017
Field of study

This report presents an approach to predict pancreatic cancer using Support Vector Machine Classification algorithm. The research objective of this project it to predict pancreatic cancer on just genomic, just clinical and combination of genomic and clinical data. We have used real genomic data having 22,763 samples and 154 features per sample. We have also created Synthetic Clinical data having 400 samples and 7 features per sample in order to predict accuracy of just clinical data. To validate the hypothesis, we have combined synthetic clinical data with subset of features from real genomic data. In our results, we observed that prediction accuracy, precision, recall with just genomic data is 80.77%, 20%, 4%. Prediction accuracy, precision, recall with just synthetic clinical data is 93.33%, 95%, 30%. While prediction accuracy, precision, recall for combination of real genomic and synthetic clinical data is 90.83%, 10%, 5%. The combination of real genomic and synthetic clinical data decreased the accuracy since the genomic data is weakly correlated. Thus we conclude that the combination of genomic and clinical data does not improve pancreatic cancer prediction accuracy. A dataset with more significant genomic features might help to predict pancreatic cancer more accurately

SJSU ScholarWorks

QAmplifyNet: Pushing the Boundaries of Supply Chain Backorder Prediction Using Interpretable Hybrid Quantum - Classical Neural Network

Author: Islam Md. Saiful
Jahin Md Abrar
Mridha M. F.
Okuyama Yuichi
Shin Jungpil
Shovon Md Sakib Hossain
Publication venue
Publication date: 24/07/2023
Field of study

Supply chain management relies on accurate backorder prediction for optimizing inventory control, reducing costs, and enhancing customer satisfaction. However, traditional machine-learning models struggle with large-scale datasets and complex relationships, hindering real-world data collection. This research introduces a novel methodological framework for supply chain backorder prediction, addressing the challenge of handling large datasets. Our proposed model, QAmplifyNet, employs quantum-inspired techniques within a quantum-classical neural network to predict backorders effectively on short and imbalanced datasets. Experimental evaluations on a benchmark dataset demonstrate QAmplifyNet's superiority over classical models, quantum ensembles, quantum neural networks, and deep reinforcement learning. Its proficiency in handling short, imbalanced datasets makes it an ideal solution for supply chain management. To enhance model interpretability, we use Explainable Artificial Intelligence techniques. Practical implications include improved inventory control, reduced backorders, and enhanced operational efficiency. QAmplifyNet seamlessly integrates into real-world supply chain management systems, enabling proactive decision-making and efficient resource allocation. Future work involves exploring additional quantum-inspired techniques, expanding the dataset, and investigating other supply chain applications. This research unlocks the potential of quantum computing in supply chain optimization and paves the way for further exploration of quantum-inspired machine learning models in supply chain management. Our framework and QAmplifyNet model offer a breakthrough approach to supply chain backorder prediction, providing superior performance and opening new avenues for leveraging quantum-inspired techniques in supply chain management

arXiv.org e-Print Archive

Credit Risk Analysis in Peer to Peer Lending Data set: Lending Club

Author: Bokhari Mohammad Mubasil
Publication venue: Bard Digital Commons
Publication date: 01/01/2019
Field of study

This project studies the classification variable ‘default’ in Peer to Peer lending dataset known as Lending Club. The project improved on existing work in terms of accuracy, F-1 measure, precision, recall, and root mean squared error. We explored balancing techniques such as oversampling the minority class, undersampling the majority class, and random forests with balanced bootstraps. We also analyzed and proposed new features that improve the Learner performance

Bard College

Credit Risk Scoring: A Stacking Generalization Approach

Author: Raimundo Bernardo Dias
Publication venue
Publication date: 23/01/2023
Field of study

Dissertation presented as the partial requirement for obtaining a Master's degree in Statistics and Information Management, specialization in Risk Analysis and ManagementCredit risk regulation has been receiving tremendous attention, as a result of the effects of the latest global financial crisis. According to the developments made in the Internal Rating Based approach, under the Basel guidelines, banks are allowed to use internal risk measures as key drivers to assess the possibility to grant a loan to an applicant. Credit scoring is a statistical approach used for evaluating potential loan applications in both financial and banking institutions. When applying for a loan, an applicant must fill out an application form detailing its characteristics (e.g., income, marital status, and loan purpose) that will serve as contributions to a credit scoring model which produces a score that is used to determine whether a loan should be granted or not. This enables faster and consistent credit approvals and the reduction of bad debt. Currently, many machine learning and statistical approaches such as logistic regression and tree-based algorithms have been used individually for credit scoring models. Newer contemporary machine learning techniques can outperform classic methods by simply combining models. This dissertation intends to be an empirical study on a publicly available bank loan dataset to study banking loan default, using ensemble-based techniques to increase model robustness and predictive power. The proposed ensemble method is based on stacking generalization an extension of various preceding studies that used different techniques to further enhance the model predictive capabilities. The results show that combining different models provides a great deal of flexibility to credit scoring models

Repositório da Universidade Nova de Lisboa

A numeric-based machine learning design for detecting organized retail fraud in digital marketplaces

Author: Bacao Fernando
Mutemi Abed
Publication venue
Publication date: 02/08/2023
Field of study

Mutemi, A., & Bacao, F. (2023). A numeric-based machine learning design for detecting organized retail fraud in digital marketplaces. Scientific Reports, 13(1), 1-16. [12499]. https://doi.org/10.1038/s41598-023-38304-5Organized retail crime (ORC) is a significant issue for retailers, marketplace platforms, and consumers. Its prevalence and influence have increased fast in lockstep with the expansion of online commerce, digital devices, and communication platforms. Today, it is a costly affair, wreaking havoc on enterprises’ overall revenues and continually jeopardizing community security. These negative consequences are set to rocket to unprecedented heights as more people and devices connect to the Internet. Detecting and responding to these terrible acts as early as possible is critical for protecting consumers and businesses while also keeping an eye on rising patterns and fraud. The issue of detecting fraud in general has been studied widely, especially in financial services, but studies focusing on organized retail crimes are extremely rare in literature. To contribute to the knowledge base in this area, we present a scalable machine learning strategy for detecting and isolating ORC listings on a prominent marketplace platform by merchants committing organized retail crimes or fraud. We employ a supervised learning approach to classify postings as fraudulent or real based on past data from buyer and seller behaviors and transactions on the platform. The proposed framework combines bespoke data preprocessing procedures, feature selection methods, and state-of-the-art class asymmetry resolution techniques to search for aligned classification algorithms capable of discriminating between fraudulent and legitimate listings in this context. Our best detection model obtains a recall score of 0.97 on the holdout set and 0.94 on the out-of-sample testing data set. We achieve these results based on a select set of 45 features out of 58.publishersversionpublishe

Repositório da Universidade Nova de Lisboa

Evaluating Sampling Techniques for Healthcare Insurance Fraud Detection in Imbalanced Dataset

Author: Hartomo Kristoko Dwi
Lopo Joanito Agili
Publication venue: 'Universitas Ahmad Dahlan, Kampus 3'
Publication date: 18/04/2023
Field of study

Detecting fraud in the healthcare insurance dataset is challenging due to severe class imbalance, where fraud cases are rare compared to non-fraud cases. Various techniques have been applied to address this problem, such as oversampling and undersampling methods. However, there is a lack of comparison and evaluation of these sampling methods. Therefore, the research contribution of this study is to conduct a comprehensive evaluation of the different sampling methods in different class distributions, utilizing multiple evaluation metrics, including , , , Precision, and Recall. In addition, a model evaluation approach be proposed to address the issue of inconsistent scores in different metrics. This study employs a real-world dataset with the XGBoost algorithm utilized alongside widely used data sampling techniques such as Random Oversampling and Undersampling, SMOTE, and Instance Hardness Threshold. Results indicate that Random Oversampling and Undersampling perform well in the 50% distribution, while SMOTE and Instance Hardness Threshold methods are more effective in the 70% distribution. Instance Hardness Threshold performs best in the 90% distribution. The 70% distribution is more robust with the SMOTE and Instance Hardness Threshold, particularly in the consistent score in different metrics, although they have longer computation times. These models consistently performed well across all evaluation metrics, indicating their ability to generalize to new unseen data in both the minority and majority classes. The study also identifies key features such as costs, diagnosis codes, type of healthcare service, gender, and severity level of diseases, which are important for accurate healthcare insurance fraud detection. These findings could be valuable for healthcare providers to make informed decisions with lower risks. A well-performing fraud detection model ensures the accurate classification of fraud and non-fraud cases. The findings also can be used by healthcare insurance providers to develop more effective fraud detection and prevention strategies

Journal of Education and Learning (EduLearn)

UAD Journal Management System

Explainable product backorder prediction exploiting CNN: Introducing explainable models in businesses

Author: Boden Alexander
Shajalal Md
Stevens Gunnar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 09/11/2022
Field of study

Due to expected positive impacts on business, the application of artificial intelligence has been widely increased. The decision-making procedures of those models are often complex and not easily understandable to the company’s stakeholders, i.e. the people having to follow up on recommendations or try to understand automated decisions of a system. This opaqueness and black-box nature might hinder adoption, as users struggle to make sense and trust the predictions of AI models. Recent research on eXplainable Artificial Intelligence (XAI) focused mainly on explaining the models to AI experts with the purpose of debugging and improving the performance of the models. In this article, we explore how such systems could be made explainable to the stakeholders. For doing so, we propose a new convolutional neural network (CNN)-based explainable predictive model for product backorder prediction in inventory management. Backorders are orders that customers place for products that are currently not in stock. The company now takes the risk to produce or acquire the backordered products while in the meantime, customers can cancel their orders if that takes too long, leaving the company with unsold items in their inventory. Hence, for their strategic inventory management, companies need to make decisions based on assumptions. Our argument is that these tasks can be improved by offering explanations for AI recommendations. Hence, our research investigates how such explanations could be provided, employing Shapley additive explanations to explain the overall models’ priority in decision-making. Besides that, we introduce locally interpretable surrogate models that can explain any individual prediction of a model. The experimental results demonstrate effectiveness in predicting backorders in terms of standard evaluation metrics and outperform known related works with AUC 0.9489. Our approach demonstrates how current limitations of predictive technologies can be addressed in the business domain

pub H-BRS - Publikationsserver der Hochschule Bonn-Rhein-Sieg