83 research outputs found

    Intelligent Financial Fraud Detection Practices: An Investigation

    Full text link
    Financial fraud is an issue with far reaching consequences in the finance industry, government, corporate sectors, and for ordinary consumers. Increasing dependence on new technologies such as cloud and mobile computing in recent years has compounded the problem. Traditional methods of detection involve extensive use of auditing, where a trained individual manually observes reports or transactions in an attempt to discover fraudulent behaviour. This method is not only time consuming, expensive and inaccurate, but in the age of big data it is also impractical. Not surprisingly, financial institutions have turned to automated processes using statistical and computational methods. This paper presents a comprehensive investigation on financial fraud detection practices using such data mining methods, with a particular focus on computational intelligence-based techniques. Classification of the practices based on key aspects such as detection algorithm used, fraud type investigated, and success rate have been covered. Issues and challenges associated with the current practices and potential future direction of research have also been identified.Comment: Proceedings of the 10th International Conference on Security and Privacy in Communication Networks (SecureComm 2014

    An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection

    Get PDF
    Fraud detection has received considerable attention from many academic research and industries worldwide due to its increasing popularity. Insurance datasets are enormous, with skewed distributions and high dimensionality. Skewed class distribution and its volume are considered significant problems while analyzing insurance datasets, as these issues increase the misclassification rates. Although sampling approaches, such as random oversampling and SMOTE can help balance the data, they can also increase the computational complexity and lead to a deterioration of model\u27s performance. So, more sophisticated techniques are needed to balance the skewed classes efficiently. This research focuses on optimizing the learner for fraud detection by applying a Fused Resampling and Cleaning Ensemble (FusedRCE) for effective sampling in health insurance fraud detection. We hypothesized that meticulous oversampling followed with a guided data cleaning would improve the prediction performance and learner\u27s understanding of the minority fraudulent classes compared to other sampling techniques. The proposed model works in three steps. As a first step, PCA is applied to extract the necessary features and reduce the dimensions in the data. In the second step, a hybrid combination of k-means clustering and SMOTE oversampling is used to resample the imbalanced data. Oversampling introduces lots of noise in the data. A thorough cleaning is performed on the balanced data to remove the noisy samples generated during oversampling using the Tomek Link algorithm in the third step. Tomek Link algorithm clears the boundary between minority and majority class samples and makes the data more precise and freer from noise. The resultant dataset is used by four different classification algorithms: Logistic Regression, Decision Tree Classifier, k-Nearest Neighbors, and Neural Networks using repeated 5-fold cross-validation. Compared to other classifiers, Neural Networks with FusedRCE had the highest average prediction rate of 98.9%. The results were also measured using parameters such as F1 score, Precision, Recall and AUC values. The results obtained show that the proposed method performed significantly better than any other fraud detection approach in health insurance by predicting more fraudulent data with greater accuracy and a 3x increase in speed during training

    Automobile Insurance Fraud Detection Using Data Mining: A Systematic Literature Review

    Get PDF
    Insurance is a pivotal element in modern society, but insurers face a persistent challenge from fraudulent behaviour performed by policyholders. This behaviour could be detrimental to both insurance companies and their honest customers, but the intricate nature of insurance fraud severely complicates its efficient, automated detection. This study surveys fifty recent publications on automobile insurance fraud detection, published between January 2019 and March 2023, and presents both the most commonly used data sets and methods for resampling and detection, as well as interesting, novel approaches. The study adopts the highly-cited Systematic Literature Review (SLR) methodology for software engineering research proposed by Kitchenham and Charters and collected studies from four online databases. The findings indicate limited public availability of automobile insurance fraud data sets. In terms of detection methods, the prevailing approach involves supervised machine learning methods that utilise structured, intrinsic features of claims or policies and that lack consideration of an example-dependent cost of misclassification. However, alternative techniques are also explored, including the use of graph-based methods, unstructured textual data, and cost-sensitive classifiers. The most common resampling approach was found to be oversampling. This SLR has identified commonly used methods in recent automobile insurance fraud detection research, and interesting directions for future research. It adds value over a related review by also including studies published from 2021 onward, and by detailing the used methodology. Limitations of this SLR include its restriction to a small number of considered publication years and limited validation of choices made during the process

    Ensemble Learning for fraud detection in Online Payment System: Fraud Detection in Online Payment System

    Get PDF
    The imbalanced problem in fraud detection systems refers to the unequal distribution of fraud cases and non-fraud cases in the information that is used to train machine learning models. This can make it difficult to accurately detect fraudulent activity. As a general rule, instances of fraud occur much less frequently than instances of other types of occurrences, which results in a dataset which is very unbalanced. This imbalance can present challenges for machine learning algorithms, as they may become biased towards the majority class (that is, non-fraud cases) and fail to accurately detect fraud. In situations like these, machine learning models may have a high accuracy overall, but a low recall for the minority class (i.e., fraud cases), which means that many instances of fraud will be misclassified as instances of something else and will not be found. In this study, Synthetic Minority Sampling Technique (SMOTE) is used for balancing the data set and the following machine learning algorithms such as decision trees, Enhanced logistic regression, Naive Bayes are used to classify the dataset.Majority Voting mechanism is used to ensemble the DT,NB, ELR methods and analyze the performance of the model. The performance of the Ensemble of various Machine Learning algorithms was superior to that of the other algorithms in terms of accuracy (98.62%), F1 score (95.21%), precision (98.02%), and recall (96.75%)

    Predicting automobile insurance fraud using classical and machine learning models

    Get PDF
    Insurance fraud claims have become a major problem in the insurance industry. Several investigations have been carried out to eliminate negative impacts on the insurance industry as this immoral act has caused the loss of billions of dollars. In this paper, a comparative study was carried out to assess the performance of various classification models, namely logistic regression, neural network (NN), support vector machine (SVM), tree augmented naĂŻve Bayes (NB), decision tree (DT), random forest (RF) and AdaBoost with different model settings for predicting automobile insurance fraud claims. Results reveal that the tree augmented NB outperformed other models based on several performance metrics with accuracy (79.35%), sensitivity (44.70%), misclassification rate (20.65%), area under curve (0.81) and Gini (0.62). In addition, the result shows that the AdaBoost algorithm can improve the classification performance of the decision tree. These findings are useful for insurance professionals to identify potential insurance fraud claim cases

    Explainable Artificial Intelligence and Causal Inference based ATM Fraud Detection

    Full text link
    Gaining the trust of customers and providing them empathy are very critical in the financial domain. Frequent occurrence of fraudulent activities affects these two factors. Hence, financial organizations and banks must take utmost care to mitigate them. Among them, ATM fraudulent transaction is a common problem faced by banks. There following are the critical challenges involved in fraud datasets: the dataset is highly imbalanced, the fraud pattern is changing, etc. Owing to the rarity of fraudulent activities, Fraud detection can be formulated as either a binary classification problem or One class classification (OCC). In this study, we handled these techniques on an ATM transactions dataset collected from India. In binary classification, we investigated the effectiveness of various over-sampling techniques, such as the Synthetic Minority Oversampling Technique (SMOTE) and its variants, Generative Adversarial Networks (GAN), to achieve oversampling. Further, we employed various machine learning techniques viz., Naive Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Gradient Boosting Tree (GBT), Multi-layer perceptron (MLP). GBT outperformed the rest of the models by achieving 0.963 AUC, and DT stands second with 0.958 AUC. DT is the winner if the complexity and interpretability aspects are considered. Among all the oversampling approaches, SMOTE and its variants were observed to perform better. In OCC, IForest attained 0.959 CR, and OCSVM secured second place with 0.947 CR. Further, we incorporated explainable artificial intelligence (XAI) and causal inference (CI) in the fraud detection framework and studied it through various analyses.Comment: 34 pages; 21 Figures; 8 Table

    Comparing the performance of oversampling techniques for imbalanced learning in insurance fraud detection

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsAlthough the current trend of data production is focused on generating tons of it every second, there are situations where the target category is represented extremely unequally, giving rise to imbalanced datasets, analyzing them correctly can lead to relevant decisions that produces appropriate business strategies. Fraud modeling is one example of this situation: it is expected less fraudulent transactions than reliable ones, predict them could be crucial for improving decisions and processes in a company. However, class imbalance produces a negative effect on traditional techniques in dealing with this problem, a lot of techniques have been proposed and oversampling is one of them. This work analyses the behavior of different oversampling techniques such as Random oversampling, SOMO and SMOTE, through different classifiers and evaluation metrics. The exercise is done with real data from an insurance company in Colombia predicting fraudulent claims for its compulsory auto product. Conclusions of this research demonstrate the advantages of using oversampling for imbalance circumstances but also the importance of comparing different evaluation metrics and classifiers to obtain accurate appropriate conclusions and comparable results

    Cost-Sensitive Learning-based Methods for Imbalanced Classification Problems with Applications

    Get PDF
    Analysis and predictive modeling of massive datasets is an extremely significant problem that arises in many practical applications. The task of predictive modeling becomes even more challenging when data are imperfect or uncertain. The real data are frequently affected by outliers, uncertain labels, and uneven distribution of classes (imbalanced data). Such uncertainties create bias and make predictive modeling an even more difficult task. In the present work, we introduce a cost-sensitive learning method (CSL) to deal with the classification of imperfect data. Typically, most traditional approaches for classification demonstrate poor performance in an environment with imperfect data. We propose the use of CSL with Support Vector Machine, which is a well-known data mining algorithm. The results reveal that the proposed algorithm produces more accurate classifiers and is more robust with respect to imperfect data. Furthermore, we explore the best performance measures to tackle imperfect data along with addressing real problems in quality control and business analytics

    Leveraging augmentation techniques for tasks with unbalancedness within the financial domain: a two-level ensemble approach

    Get PDF
    Modern financial markets produce massive datasets that need to be analysed using new modelling techniques like those from (deep) Machine Learning and Artificial Intelligence. The common goal of these techniques is to forecast the behaviour of the market, which can be translated into various classification tasks, such as, for instance, predicting the likelihood of companies’ bankruptcy or in fraud detection systems. However, it is often the case that real-world financial data are unbalanced, meaning that the classes’ distribution is not equally represented in such datasets. This gives the main issue since any Machine Learning model is trained according to the majority class mainly, leading to inaccurate predictions. In this paper, we explore different data augmentation techniques to deal with very unbalanced financial data. We consider a number of publicly available datasets, then apply state-of-the-art augmentation strategies to them, and finally evaluate the results for several Machine Learning models trained on the sampled data. The performance of the various approaches is evaluated according to their accuracy, micro, and macro F1 score, and finally by analyzing the precision and recall over the minority class. We show that a consistent and accurate improvement is achieved when data augmentation is employed. The obtained classification results look promising and indicate the efficiency of augmentation strategies on financial tasks. On the basis of these results, we present an approach focused on classification tasks within the financial domain that takes a dataset as input, identifies what kind of augmentation technique to use, and then applies an ensemble of all the augmentation techniques of the identified type to the input dataset along with an ensemble of different methods to tackle the underlying classification

    Comparison of Classification Algorithms and Undersampling Methods on Employee Churn Prediction: A Case Study of a Tech Company

    Get PDF
    Churn prediction is a common data mining problem that many companies face across industries. More commonly, customer churn has been studied extensively within the telecommunications industry where there is low customer retention due to high market competition. Similar to customer churn, employee churn is very costly to a company and by not deploying proper risk mitigation strategies, profits cannot be maximized, and valuable employees may leave the company. The cost to replace an employee is exponentially higher than finding a replacement, so it is in any company’s best interest to prioritize employee retention. This research combines machine learning techniques with undersampling in hopes of identifying employees at risk of churn so retention strategies can be implemented before it is too late. Four different classification algorithms are tested on a variety of undersampled datasets in order to find the most effective undersampling and classification method for predicting employee churn. Statistical analysis is conducted on the appropriate evaluation metrics to find the most significant methods. The results of this study can be used by the company to target individuals at risk of churn so that risk mitigation strategies can be effective in retaining the valuable employees. Methods and results can be tested and applied across different industries and companies
    • …
    corecore