759 research outputs found
A Comprehensive Survey of Data Mining-based Fraud Detection Research
This survey paper categorises, compares, and summarises from almost all
published technical and review articles in automated fraud detection within the
last 10 years. It defines the professional fraudster, formalises the main types
and subtypes of known fraud, and presents the nature of data evidence collected
within affected industries. Within the business context of mining the data to
achieve higher cost savings, this research presents methods and techniques
together with their problems. Compared to all related reviews on fraud
detection, this survey covers much more technical articles and is the only one,
to the best of our knowledge, which proposes alternative data and solutions
from related domains.Comment: 14 page
One-Class Classification: Taxonomy of Study and Review of Techniques
One-class classification (OCC) algorithms aim to build classification models
when the negative class is either absent, poorly sampled or not well defined.
This unique situation constrains the learning of efficient classifiers by
defining class boundary just with the knowledge of positive class. The OCC
problem has been considered and applied under many research themes, such as
outlier/novelty detection and concept learning. In this paper we present a
unified view of the general problem of OCC by presenting a taxonomy of study
for OCC problems, which is based on the availability of training data,
algorithms used and the application domains applied. We further delve into each
of the categories of the proposed taxonomy and present a comprehensive
literature review of the OCC algorithms, techniques and methodologies with a
focus on their significance, limitations and applications. We conclude our
paper by discussing some open research problems in the field of OCC and present
our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure
Explainable Artificial Intelligence and Causal Inference based ATM Fraud Detection
Gaining the trust of customers and providing them empathy are very critical
in the financial domain. Frequent occurrence of fraudulent activities affects
these two factors. Hence, financial organizations and banks must take utmost
care to mitigate them. Among them, ATM fraudulent transaction is a common
problem faced by banks. There following are the critical challenges involved in
fraud datasets: the dataset is highly imbalanced, the fraud pattern is
changing, etc. Owing to the rarity of fraudulent activities, Fraud detection
can be formulated as either a binary classification problem or One class
classification (OCC). In this study, we handled these techniques on an ATM
transactions dataset collected from India. In binary classification, we
investigated the effectiveness of various over-sampling techniques, such as the
Synthetic Minority Oversampling Technique (SMOTE) and its variants, Generative
Adversarial Networks (GAN), to achieve oversampling. Further, we employed
various machine learning techniques viz., Naive Bayes (NB), Logistic Regression
(LR), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF),
Gradient Boosting Tree (GBT), Multi-layer perceptron (MLP). GBT outperformed
the rest of the models by achieving 0.963 AUC, and DT stands second with 0.958
AUC. DT is the winner if the complexity and interpretability aspects are
considered. Among all the oversampling approaches, SMOTE and its variants were
observed to perform better. In OCC, IForest attained 0.959 CR, and OCSVM
secured second place with 0.947 CR. Further, we incorporated explainable
artificial intelligence (XAI) and causal inference (CI) in the fraud detection
framework and studied it through various analyses.Comment: 34 pages; 21 Figures; 8 Table
Differential evolution technique on weighted voting stacking ensemble method for credit card fraud detection
Differential Evolution is an optimization technique of stochastic search for a population-based vector, which is powerful and efficient over a continuous space for solving differentiable and non-linear optimization problems. Weighted voting stacking ensemble method is an important technique that combines various classifier models. However, selecting the appropriate weights of classifier models for the correct
classification of transactions is a problem. This research study is therefore aimed at exploring whether the Differential Evolution optimization method is a good approach for defining the weighting function. Manual and random selection of weights for voting credit card transactions has previously been carried out. However, a large number of fraudulent transactions were not detected by the classifier models. Which means that a technique to overcome the weaknesses of the classifier models is required. Thus, the problem of selecting the
appropriate weights was viewed as the problem of weights optimization in this study. The dataset was downloaded from the Kaggle competition data repository. Various machine learning algorithms were used to weight vote a class of transaction. The differential evolution optimization techniques was used as a weighting function. In
addition, the Synthetic Minority Oversampling Technique (SMOTE) and Safe Level Synthetic Minority Oversampling Technique (SL-SMOTE) oversampling algorithms were modified to preserve the definition of SMOTE while improving the performance. Result generated from this research study showed that the Differential Evolution
Optimization method is a good weighting function, which can be adopted as a systematic weight function for weight voting stacking ensemble method of various classification methods.School of ComputingM. Sc. (Computing
Credit Card Fraud Detection Using Machine Learning Algorithms
One of the main challenges to the security of an online business is credit card fraud. For this reason, algorithms based on artificial intelligence and machine learning are being introduced to enable the most accurate and fast detection of card fraud. This paper presents an approach to the detection of card fraud based on machine learning algorithms more specifically, a multilayer perceptron (MLP) and a Decision tree. The aforementioned algorithms were trained and tested using a publicly available
data set on card fraud. The data set used consists of 7 characteristics of the card transaction and information on whether there was card fraud or not. In total, the data set contains information on 1,000,000 transactions, and it is highly imbalanced. To handle the class imbalance, random undersampling, SMOTE, and SMOTE-Tomek algorithms were proposed. From the achieved results it can be seen that the highest performances are achieved if MLP (AUC = 0.99, f1 = 0.99, MCC = 0.98, and Kappa = 0.98) and Decision tree (AUC = 0.99, f1 = 0.99, MCC = 0.99, and Kappa = 0.98) are trained by using data set re-sampled by using SMOTE-Tomek algorithm. If the performance of the mentioned algorithms is examined using fewer characteristics of the transaction, it can be seen that by reducing the number of characteristics a significant decrease in classification performances can be noticed if a Decision tree in combination with SMOTE-Tomek is used. However, if an MLP in combination with SMOTE-Tomek is used, a significantly lower decrease in performance can be observed, pointing to the higher robustness to input vector dimension reduction. Such a robust system can provide information about transaction validity even in a condition where the input data is limited to a few input variables. From the achieved results, it can be concluded that MLP in combination with the SMOTE-Tomek algorithm can be used for credit card fraud detection, even in conditions with a lower number of input variables
A comparative analysis of classifiers in cancer prediction using multiple data mining techniques
In recent years, application of data mining methods in health industry has received increased attention from both health professionals and scholars. This paper presents a data mining framework for detecting breast cancer based on real data from one of Iran hospitals by applying association rules and the most commonly used classifiers. The former were adopted for reducing the size of datasets, while the latter were chosen for cancer prediction. A k-fold cross validation procedure was included for evaluating the performance of the proposed classifiers. Among the six classifiers used in this paper, support vector machine achieved the best results, with an accuracy of 93%. It is worth mentioning that the approach proposed can be applied for detecting other diseases as well
Credit Card Fraud Detection Using Machine Learning Algorithms
One of the main challenges to the security of an online business is credit card fraud. For this reason, algorithms based on artificial intelligence and machine learning are being introduced to enable the most accurate and fast detection of card fraud. This paper presents an approach to the detection of card fraud based on machine learning algorithms more specifically, a multilayer perceptron (MLP) and a Decision tree. The aforementioned algorithms were trained and tested using a publicly available
data set on card fraud. The data set used consists of 7 characteristics of the card transaction and information on whether there was card fraud or not. In total, the data set contains information on 1,000,000 transactions, and it is highly imbalanced. To handle the class imbalance, random undersampling, SMOTE, and SMOTE-Tomek algorithms were proposed. From the achieved results it can be seen that the highest performances are achieved if MLP (AUC = 0.99, f1 = 0.99, MCC = 0.98, and Kappa = 0.98) and Decision tree (AUC = 0.99, f1 = 0.99, MCC = 0.99, and Kappa = 0.98) are trained by using data set re-sampled by using SMOTE-Tomek algorithm. If the performance of the mentioned algorithms is examined using fewer characteristics of the transaction, it can be seen that by reducing the number of characteristics a significant decrease in classification performances can be noticed if a Decision tree in combination with SMOTE-Tomek is used. However, if an MLP in combination with SMOTE-Tomek is used, a significantly lower decrease in performance can be observed, pointing to the higher robustness to input vector dimension reduction. Such a robust system can provide information about transaction validity even in a condition where the input data is limited to a few input variables. From the achieved results, it can be concluded that MLP in combination with the SMOTE-Tomek algorithm can be used for credit card fraud detection, even in conditions with a lower number of input variables
An enhanced resampling technique for imbalanced data sets
A data set is considered imbalanced if the distribution of instances in one class (majority class) outnumbers the other class (minority class). The main problem related
to binary imbalanced data sets is classifiers tend to ignore the minority class. Numerous resampling techniques such as undersampling, oversampling, and a combination of both techniques have been widely used. However, the undersampling and oversampling techniques suffer from elimination and addition of relevant data which may lead to poor classification results. Hence, this study aims to increase classification metrics by enhancing the undersampling technique and combining it
with an existing oversampling technique. To achieve this objective, a Fuzzy Distancebased
Undersampling (FDUS) is proposed. Entropy estimation is used to produce fuzzy thresholds to categorise the instances in majority and minority class into membership functions. FDUS is then combined with the Synthetic Minority
Oversampling TEchnique (SMOTE) known as FDUS+SMOTE, which is executed in sequence until a balanced data set is achieved. FDUS and FDUS+SMOTE are compared with four techniques based on classification accuracy, F-measure and Gmean. From the results, FDUS achieved better classification accuracy, F-measure and G-mean, compared to the other techniques with an average of 80.57%, 0.85 and 0.78, respectively. This showed that fuzzy logic when incorporated with Distance-based Undersampling technique was able to reduce the elimination of relevant data. Further, the findings showed that FDUS+SMOTE performed better than combination of
SMOTE and Tomek Links, and SMOTE and Edited Nearest Neighbour on benchmark data sets. FDUS+SMOTE has minimised the removal of relevant data from the majority class and avoid overfitting. On average, FDUS and FDUS+SMOTE were able to balance categorical, integer and real data sets and enhanced the performance
of binary classification. Furthermore, the techniques performed well on small record
size data sets that have of instances in the range of approximately 100 to 800
- …