24 research outputs found

    Fraud: and anomaly detection in healthcare: an unsupervised machine learning approach

    Get PDF
    Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business AnalyticsFraud and abuse in healthcare are critical and cause significant damage. However, the auditing of healthcare encounters is cumbersome, and the detection of fraud and abuse is challenging and binds capacity. Data-driven fraud and anomaly detection models can help to overcome these issues. This work proposes several unsupervised learning methods to understand patterns and detect abnormal healthcare encounters which might be fraudulent or abusive. The ensemble of models is split into sub-processes and applied on a healthcare data set belonging to Future Healthcare group, a Portuguese group acting in health insurance. One major part of the ensemble is the implementation of the Isolation Forest algorithm, which achieves good results in precision and recall and detect new potential fraudulent abnormal behaviour. Due to unlabelled data and the application of unsupervised learning methods, the proposed model detects new fraudulent patterns instead of learning from existing patterns. Besides the model to predict whether new incoming medical encounters are fraudulent or abusive, this work illustrates a visual method to detect suspicious networks among medical providers. In addition, this work contains an approach to predict whether a customer will cancel the insurance based on anomalous behaviour. This internship report aims to contribute to science and be public, even though some parts could not be explained in detail due to confidentiality

    The classification performance of ensemble decision tree classifiers: a case study of detecting fraud in credit card transactions

    Get PDF
    In this dissertation, we propose ensemble decision tree classifiers as an ideal classification technique for solving the problem of fraud in the domain of credit card transactions. Ensemble tree classifiers have been applied in many areas like speech recognition, image recognition and medical diagnostics and have shown excellent results. At the centre of fraud, credit card fraud has been a major concern. The rise in credit card fraud is largely attributed to the nature in which it can be done. A fraudster does not need to always be physically present to commit fraud making it the number one target for criminals. Card-Not-Present refers to this type of fraud where an electronic transaction can be conducted without the need for a client to be present. This can be done via telephonic calls or the web. To be able to come up with better classifiers it was important for the researcher to first investigate what causes misclassifications in fraud detection systems. A systematic literature review was done to uncover the factors that have been identified as causes of misclassifications. It was discovered that many factors lead to misclassifications and several authors have proposed techniques to handle these factors. However, there is no universal techniques for addressing factors that lead to misclassifications as different domains have different datasets which require different techniques. This study investigates how parameters involved in modelling fraud detection systems impact the classification performance of ensemble decision tree classifiers. The factors that were investigated include sample size, sampling technique, learning method and choice of split criterion and how they affect classification performance. A series of experiments were conducted to investigate how the aforementioned factors contributed to better classifiers. Ecommerce data from Vesta corporation made available on Kaggle was used in the experiments. The data was split into two sets, one for training the models and the other for testing the performance of the models. Accuracy, confusion matrix, precision and recall were used as performance measures. Our results showed that a larger sample size resulted in better classifiers. This is attributed to models having more instances to learn from which covers most patterns of fraudulent transactions. The sampling technique was shown to be pivotal in classification performance as under sampling showed a great reduction in performance as it achieved a maximum accuracy of 89.6223 while oversampling produced increased performance with maximum accuracy of 99.9531. Furthermore, our results showed that the choice of split criterion impacts the performance of ensemble tree classifiers. The use of entropy as the choice of split criterion resulted in better classifiers compared to the use of the Gini index. However, the downside is that entropy requires more time to execute compared to the Gini index. Lastly, the learning method proved to impact the performance of ensemble classifiers. Models that used supervised learning had better performance compared to those that use unsupervised learning in detecting credit card fraud. The conclusions from this research are insightful when designing fraud detection systems that use ensemble decision tree classifiers as base learners.Thesis (Msci) -- Faculty of Science and Agriculture, 202

    The classification performance of ensemble decision tree classifiers: a case study of detecting fraud in credit card transactions

    Get PDF
    In this dissertation, we propose ensemble decision tree classifiers as an ideal classification technique for solving the problem of fraud in the domain of credit card transactions. Ensemble tree classifiers have been applied in many areas like speech recognition, image recognition and medical diagnostics and have shown excellent results. At the centre of fraud, credit card fraud has been a major concern. The rise in credit card fraud is largely attributed to the nature in which it can be done. A fraudster does not need to always be physically present to commit fraud making it the number one target for criminals. Card-Not-Present refers to this type of fraud where an electronic transaction can be conducted without the need for a client to be present. This can be done via telephonic calls or the web. To be able to come up with better classifiers it was important for the researcher to first investigate what causes misclassifications in fraud detection systems. A systematic literature review was done to uncover the factors that have been identified as causes of misclassifications. It was discovered that many factors lead to misclassifications and several authors have proposed techniques to handle these factors. However, there is no universal techniques for addressing factors that lead to misclassifications as different domains have different datasets which require different techniques. This study investigates how parameters involved in modelling fraud detection systems impact the classification performance of ensemble decision tree classifiers. The factors that were investigated include sample size, sampling technique, learning method and choice of split criterion and how they affect classification performance. A series of experiments were conducted to investigate how the aforementioned factors contributed to better classifiers. Ecommerce data from Vesta corporation made available on Kaggle was used in the experiments. The data was split into two sets, one for training the models and the other for testing the performance of the models. Accuracy, confusion matrix, precision and recall were used as performance measures. Our results showed that a larger sample size resulted in better classifiers. This is attributed to models having more instances to learn from which covers most patterns of fraudulent transactions. The sampling technique was shown to be pivotal in classification performance as under sampling showed a great reduction in performance as it achieved a maximum accuracy of 89.6223 while oversampling produced increased performance with maximum accuracy of 99.9531. Furthermore, our results showed that the choice of split criterion impacts the performance of ensemble tree classifiers. The use of entropy as the choice of split criterion resulted in better classifiers compared to the use of the Gini index. However, the downside is that entropy requires more time to execute compared to the Gini index. Lastly, the learning method proved to impact the performance of ensemble classifiers. Models that used supervised learning had better performance compared to those that use unsupervised learning in detecting credit card fraud. The conclusions from this research are insightful when designing fraud detection systems that use ensemble decision tree classifiers as base learners.Thesis (Msci) -- Faculty of Science and Agriculture, 202

    Statistical Challenges and Methods for Missing and Imbalanced Data

    Get PDF
    Missing data remains a prevalent issue in every area of research. The impact of missing data, if not carefully handled, can be detrimental to any statistical analysis. Some statistical challenges associated with missing data include, loss of information, reduced statistical power and non-generalizability of findings in a study. It is therefore crucial that researchers pay close and particular attention when dealing with missing data. This multi-paper dissertation provides insight into missing data across different fields of study and addresses some of the above mentioned challenges of missing data through simulation studies and application to real datasets. The first paper of this dissertation addresses the dropout phenomenon in single-cell RNA (scRNA) sequencing through a comparative analyses of some existing scRNA sequencing techniques. The second paper of this work focuses on using simulation studies to assess whether it is appropriate to address the issue of non-detects in data using a traditional substitution approach, imputation, or a non-imputation based approach. The final paper of this dissertation presents an efficient strategy to address the issue of imbalance in data at any degree (whether moderate or highly imbalanced) by combining random undersampling with different weighting strategies. We conclude generally, based on findings from this dissertation that, missingness is not always lack of information but interestingness that needs to investigated

    A new Monte Carlo sampling method based on Gaussian Mixture Model for imbalanced data classification

    Get PDF
    Imbalanced data classification has been a major topic in the machine learning community. Different approaches can be taken to solve the issue in recent years, and researchers have given a lot of attention to data level techniques and algorithm level. However, existing methods often generate samples in specific regions without considering the complexity of imbalanced distributions. This can lead to learning models overemphasizing certain difficult factors in the minority data. In this paper, a Monte Carlo sampling algorithm based on Gaussian Mixture Model (MCS-GMM) is proposed. In MCS-GMM, we utilize the Gaussian mixed model to fit the distribution of the imbalanced data and apply the Monte Carlo algorithm to generate new data. Then, in order to reduce the impact of data overlap, the three sigma rule is used to divide data into four types, and the weight of each minority class instance based on its neighbor and probability density function. Based on experiments conducted on Knowledge Extraction based on Evolutionary Learning datasets, our method has been proven to be effective and outperforms existing approaches such as Synthetic Minority Over-sampling TEchnique

    Computational intelligent hybrid model for detecting disruptive trading activity

    Get PDF
    The term “disruptive trading behaviour” was first proposed by the U.S. Commodity Futures Trading Commission and is now widely used by US and EU regulation (MiFID II) to describe activities that create a misleading appearance of market liquidity or depth or an artificial price movement upward or downward according to their own purposes. Such activities, identified as a new form of financial fraud in EU regulations, damage the proper functioning and integrity of capital markets and are hence extremely harmful. While existing studies have explored this issue, they have, in most cases, either focused on empirical analysis of such cases or proposed detection models based on certain assumptions of the market. Effective methods that can analyse and detect such disruptive activities based on direct studies of trading behaviours have not been studied to date. There exists, accordingly, a knowledge gap in the literature. This paper seeks to address that gap and provides a hybrid model composed of two data-mining-based detection modules that effectively identify disruptive trading behaviours. The hybrid model is designed to work in an on-line scheme. The limit order stream is transformed, calculated and extracted as a feature stream. One detection module, “Single Order Detection,” detects disruptive behaviours by identifying abnormal patterns of every single trading order. Another module, “Order Sequence Detection,” approaches the problem by examining the contextual relationships of a sequence of trading orders using an extended hidden Markov model, which identifies whether sequential changes from the extracted features are manipulative activities (or not). Both models were evaluated using huge volumes of real tick data from the NASDAQ, which demonstrated that both are able to identify a range of disruptive trading behaviours and, furthermore, that they outperform the selected traditional benchmark models. Thus, this hybrid model is shown to make a substantial contribution to the literature on financial market surveillance and to offer a practical and effective approach for the identification of disruptive trading behaviour

    Credit Card Fraud Detection using One-Class Classification Algorithms

    Get PDF
    Similar to most things in everyday life, the advent of payment cards also has good and bad sides. It undoubtedly, made life easier by bringing the whole payment system to a single card, but it also paved the way for a new set of illegal activities and frauds. The credit card fraud has been carried out since the payment cards came into existence, and since then, the trend in such frauds has been an increasing one. Therefore, a quest to attenuate the losses caused by such frauds began. For this purpose, many preventive and detective measures have been taken in the past, and new ways are sought to further improve the policies. These measures, however, reduce the losses temporarily only and have not yet succeeded in converting the uptrend in the losses by such frauds into a downtrend because fraudsters always come up with a new way of tricking the people and the system. Thus, a new way of solving this ever-existing challenge is needed, which can detect even those fraudulent instances that are executed by techniques and methods that are yet-to-be-invented by fraudsters. Moreover, the occurrence of normal (non-fraudulent) credit card transactions is much more than fraudulent ones, and therefore, the data for credit card fraud detection is highly imbalanced. Another challenge in credit card fraud detection systems is the high dimensionality of datasets. Therefore, to address the imbalance nature of the data, to cope with the curse of dimensionality with a new way of making the model to regulate and extract the discriminative features, and to detect the fraud carried out by yet-to-be-invented techniques, we implemented a set of novel and state of the art subspace learning-based One-Class Classification algorithms. We experimented with integrating a projection matrix and geometric data information in the training phase to improve credit card fraud detection. We also experimented by using a maximization-update rule in updating the projection matrix instead of the classical minimization-update rule in the subspace leaning-based data description. We found that the linear version of Graph-embedded Subspace Support Vector Data Description with kNN graph, gradient-based solution, and minimization-update rule works better than all other models

    Improved adaptive semi-unsupervised weighted oversampling (IA-SUWO) using sparsity factor for imbalanced datasets

    Get PDF
    The imbalanced data problem is common in data mining nowadays due to the skewed nature of data, which impact the classification process negatively in machine learning. For preprocessing, oversampling techniques significantly benefitted the imbalanced domain, in which artificial data is generated in minority class to enhance the number of samples and balance the distribution of samples in both classes. However, existing oversampling techniques encounter through overfitting and over-generalization problems which lessen the classifier performance. Although many clustering based oversampling techniques significantly overcome these problems but most of these techniques are not able to produce the appropriate number of synthetic samples in minority clusters. This study proposed an improved Adaptive Semi-unsupervised Weighted Oversampling (IA-SUWO) technique, using the sparsity factor which determine the sparse minority samples in each minority cluster. This technique consider the sparse minority samples which are far from the decision boundary. These samples also carry the important information for learning of minority class, if these samples are also considered for oversampling, imbalance ratio will be more reduce also it could enhance the learnability of the classifiers. The outcomes of the proposed approach have been compared with existing oversampling techniques such as SMOTE, Borderline-SMOTE, Safe-level SMOTE, and standard A-SUWO technique in terms of accuracy. As aforementioned, the comparative analysis revealed that the proposed oversampling approach performance increased in average by 5% from 85% to 90% than the existing comparative techniques
    corecore