759 research outputs found

    A Comprehensive Survey of Data Mining-based Fraud Detection Research

    Full text link
    This survey paper categorises, compares, and summarises from almost all published technical and review articles in automated fraud detection within the last 10 years. It defines the professional fraudster, formalises the main types and subtypes of known fraud, and presents the nature of data evidence collected within affected industries. Within the business context of mining the data to achieve higher cost savings, this research presents methods and techniques together with their problems. Compared to all related reviews on fraud detection, this survey covers much more technical articles and is the only one, to the best of our knowledge, which proposes alternative data and solutions from related domains.Comment: 14 page

    One-Class Classification: Taxonomy of Study and Review of Techniques

    Full text link
    One-class classification (OCC) algorithms aim to build classification models when the negative class is either absent, poorly sampled or not well defined. This unique situation constrains the learning of efficient classifiers by defining class boundary just with the knowledge of positive class. The OCC problem has been considered and applied under many research themes, such as outlier/novelty detection and concept learning. In this paper we present a unified view of the general problem of OCC by presenting a taxonomy of study for OCC problems, which is based on the availability of training data, algorithms used and the application domains applied. We further delve into each of the categories of the proposed taxonomy and present a comprehensive literature review of the OCC algorithms, techniques and methodologies with a focus on their significance, limitations and applications. We conclude our paper by discussing some open research problems in the field of OCC and present our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure

    Explainable Artificial Intelligence and Causal Inference based ATM Fraud Detection

    Full text link
    Gaining the trust of customers and providing them empathy are very critical in the financial domain. Frequent occurrence of fraudulent activities affects these two factors. Hence, financial organizations and banks must take utmost care to mitigate them. Among them, ATM fraudulent transaction is a common problem faced by banks. There following are the critical challenges involved in fraud datasets: the dataset is highly imbalanced, the fraud pattern is changing, etc. Owing to the rarity of fraudulent activities, Fraud detection can be formulated as either a binary classification problem or One class classification (OCC). In this study, we handled these techniques on an ATM transactions dataset collected from India. In binary classification, we investigated the effectiveness of various over-sampling techniques, such as the Synthetic Minority Oversampling Technique (SMOTE) and its variants, Generative Adversarial Networks (GAN), to achieve oversampling. Further, we employed various machine learning techniques viz., Naive Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Gradient Boosting Tree (GBT), Multi-layer perceptron (MLP). GBT outperformed the rest of the models by achieving 0.963 AUC, and DT stands second with 0.958 AUC. DT is the winner if the complexity and interpretability aspects are considered. Among all the oversampling approaches, SMOTE and its variants were observed to perform better. In OCC, IForest attained 0.959 CR, and OCSVM secured second place with 0.947 CR. Further, we incorporated explainable artificial intelligence (XAI) and causal inference (CI) in the fraud detection framework and studied it through various analyses.Comment: 34 pages; 21 Figures; 8 Table

    Differential evolution technique on weighted voting stacking ensemble method for credit card fraud detection

    Get PDF
    Differential Evolution is an optimization technique of stochastic search for a population-based vector, which is powerful and efficient over a continuous space for solving differentiable and non-linear optimization problems. Weighted voting stacking ensemble method is an important technique that combines various classifier models. However, selecting the appropriate weights of classifier models for the correct classification of transactions is a problem. This research study is therefore aimed at exploring whether the Differential Evolution optimization method is a good approach for defining the weighting function. Manual and random selection of weights for voting credit card transactions has previously been carried out. However, a large number of fraudulent transactions were not detected by the classifier models. Which means that a technique to overcome the weaknesses of the classifier models is required. Thus, the problem of selecting the appropriate weights was viewed as the problem of weights optimization in this study. The dataset was downloaded from the Kaggle competition data repository. Various machine learning algorithms were used to weight vote a class of transaction. The differential evolution optimization techniques was used as a weighting function. In addition, the Synthetic Minority Oversampling Technique (SMOTE) and Safe Level Synthetic Minority Oversampling Technique (SL-SMOTE) oversampling algorithms were modified to preserve the definition of SMOTE while improving the performance. Result generated from this research study showed that the Differential Evolution Optimization method is a good weighting function, which can be adopted as a systematic weight function for weight voting stacking ensemble method of various classification methods.School of ComputingM. Sc. (Computing

    Credit Card Fraud Detection Using Machine Learning Algorithms

    Get PDF
    One of the main challenges to the security of an online business is credit card fraud. For this reason, algorithms based on artificial intelligence and machine learning are being introduced to enable the most accurate and fast detection of card fraud. This paper presents an approach to the detection of card fraud based on machine learning algorithms more specifically, a multilayer perceptron (MLP) and a Decision tree. The aforementioned algorithms were trained and tested using a publicly available data set on card fraud. The data set used consists of 7 characteristics of the card transaction and information on whether there was card fraud or not. In total, the data set contains information on 1,000,000 transactions, and it is highly imbalanced. To handle the class imbalance, random undersampling, SMOTE, and SMOTE-Tomek algorithms were proposed. From the achieved results it can be seen that the highest performances are achieved if MLP (AUC = 0.99, f1 = 0.99, MCC = 0.98, and Kappa = 0.98) and Decision tree (AUC = 0.99, f1 = 0.99, MCC = 0.99, and Kappa = 0.98) are trained by using data set re-sampled by using SMOTE-Tomek algorithm. If the performance of the mentioned algorithms is examined using fewer characteristics of the transaction, it can be seen that by reducing the number of characteristics a significant decrease in classification performances can be noticed if a Decision tree in combination with SMOTE-Tomek is used. However, if an MLP in combination with SMOTE-Tomek is used, a significantly lower decrease in performance can be observed, pointing to the higher robustness to input vector dimension reduction. Such a robust system can provide information about transaction validity even in a condition where the input data is limited to a few input variables. From the achieved results, it can be concluded that MLP in combination with the SMOTE-Tomek algorithm can be used for credit card fraud detection, even in conditions with a lower number of input variables

    A comparative analysis of classifiers in cancer prediction using multiple data mining techniques

    Get PDF
    In recent years, application of data mining methods in health industry has received increased attention from both health professionals and scholars. This paper presents a data mining framework for detecting breast cancer based on real data from one of Iran hospitals by applying association rules and the most commonly used classifiers. The former were adopted for reducing the size of datasets, while the latter were chosen for cancer prediction. A k-fold cross validation procedure was included for evaluating the performance of the proposed classifiers. Among the six classifiers used in this paper, support vector machine achieved the best results, with an accuracy of 93%. It is worth mentioning that the approach proposed can be applied for detecting other diseases as well

    Credit Card Fraud Detection Using Machine Learning Algorithms

    Get PDF
    One of the main challenges to the security of an online business is credit card fraud. For this reason, algorithms based on artificial intelligence and machine learning are being introduced to enable the most accurate and fast detection of card fraud. This paper presents an approach to the detection of card fraud based on machine learning algorithms more specifically, a multilayer perceptron (MLP) and a Decision tree. The aforementioned algorithms were trained and tested using a publicly available data set on card fraud. The data set used consists of 7 characteristics of the card transaction and information on whether there was card fraud or not. In total, the data set contains information on 1,000,000 transactions, and it is highly imbalanced. To handle the class imbalance, random undersampling, SMOTE, and SMOTE-Tomek algorithms were proposed. From the achieved results it can be seen that the highest performances are achieved if MLP (AUC = 0.99, f1 = 0.99, MCC = 0.98, and Kappa = 0.98) and Decision tree (AUC = 0.99, f1 = 0.99, MCC = 0.99, and Kappa = 0.98) are trained by using data set re-sampled by using SMOTE-Tomek algorithm. If the performance of the mentioned algorithms is examined using fewer characteristics of the transaction, it can be seen that by reducing the number of characteristics a significant decrease in classification performances can be noticed if a Decision tree in combination with SMOTE-Tomek is used. However, if an MLP in combination with SMOTE-Tomek is used, a significantly lower decrease in performance can be observed, pointing to the higher robustness to input vector dimension reduction. Such a robust system can provide information about transaction validity even in a condition where the input data is limited to a few input variables. From the achieved results, it can be concluded that MLP in combination with the SMOTE-Tomek algorithm can be used for credit card fraud detection, even in conditions with a lower number of input variables

    An enhanced resampling technique for imbalanced data sets

    Get PDF
    A data set is considered imbalanced if the distribution of instances in one class (majority class) outnumbers the other class (minority class). The main problem related to binary imbalanced data sets is classifiers tend to ignore the minority class. Numerous resampling techniques such as undersampling, oversampling, and a combination of both techniques have been widely used. However, the undersampling and oversampling techniques suffer from elimination and addition of relevant data which may lead to poor classification results. Hence, this study aims to increase classification metrics by enhancing the undersampling technique and combining it with an existing oversampling technique. To achieve this objective, a Fuzzy Distancebased Undersampling (FDUS) is proposed. Entropy estimation is used to produce fuzzy thresholds to categorise the instances in majority and minority class into membership functions. FDUS is then combined with the Synthetic Minority Oversampling TEchnique (SMOTE) known as FDUS+SMOTE, which is executed in sequence until a balanced data set is achieved. FDUS and FDUS+SMOTE are compared with four techniques based on classification accuracy, F-measure and Gmean. From the results, FDUS achieved better classification accuracy, F-measure and G-mean, compared to the other techniques with an average of 80.57%, 0.85 and 0.78, respectively. This showed that fuzzy logic when incorporated with Distance-based Undersampling technique was able to reduce the elimination of relevant data. Further, the findings showed that FDUS+SMOTE performed better than combination of SMOTE and Tomek Links, and SMOTE and Edited Nearest Neighbour on benchmark data sets. FDUS+SMOTE has minimised the removal of relevant data from the majority class and avoid overfitting. On average, FDUS and FDUS+SMOTE were able to balance categorical, integer and real data sets and enhanced the performance of binary classification. Furthermore, the techniques performed well on small record size data sets that have of instances in the range of approximately 100 to 800
    corecore