510 research outputs found

    Can Generative Adversarial Networks Help Us Fight Financial Fraud?

    Get PDF
    Transactional fraud datasets exhibit extreme class imbalance. Learners cannot make accurate generalizations without sufficient data. Researchers can account for imbalance at the data level, algorithmic level or both. This paper focuses on techniques at the data level. We evaluate the evidence of the optimal technique and potential enhancements. Global fraud losses totalled more than 80 % of the UK’s GDP in 2019. The improvement of preprocessing is inherently valuable in fighting these losses. Synthetic minority oversampling technique (SMOTE) and extensions of SMOTE are currently the most common preprocessing strategies. SMOTE oversamples the minority classes by randomly generating a point between a minority instance and its nearest neighbour. Recent papers adopt generative adversarial networks (GAN) for data synthetic creation. Since 2014 there had been several GAN extensions, from improved training mechanisms to frameworks specifically for tabular data. The primary aim of the research is to understand the benefits of GANs built specifically for tabular data on supervised classifiers performance. We determine if this framework will outperform traditional methods and more common GAN frameworks. Secondly, we propose a framework that allows individuals to test the impact of imbalance ratios on classifier performance. Finally, we investigate the use of clustering and determine if this information can help GANs create better synthetic information. We explore this in the context of commonly used supervised classifiers and ensemble methods

    Outlier Mining Methods Based on Graph Structure Analysis

    Get PDF
    Outlier detection in high-dimensional datasets is a fundamental and challenging problem across disciplines that has also practical implications, as removing outliers from the training set improves the performance of machine learning algorithms. While many outlier mining algorithms have been proposed in the literature, they tend to be valid or efficient for specific types of datasets (time series, images, videos, etc.). Here we propose two methods that can be applied to generic datasets, as long as there is a meaningful measure of distance between pairs of elements of the dataset. Both methods start by defining a graph, where the nodes are the elements of the dataset, and the links have associated weights that are the distances between the nodes. Then, the first method assigns an outlier score based on the percolation (i.e., the fragmentation) of the graph. The second method uses the popular IsoMap non-linear dimensionality reduction algorithm, and assigns an outlier score by comparing the geodesic distances with the distances in the reduced space. We test these algorithms on real and synthetic datasets and show that they either outperform, or perform on par with other popular outlier detection methods. A main advantage of the percolation method is that is parameter free and therefore, it does not require any training; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested.Peer ReviewedPostprint (published version

    SMOTified-GAN for class imbalanced pattern classification problems

    Get PDF
    Class imbalance in a dataset is a major problem for classifiers that results in poor prediction with a high true positive rate (TPR) but a low true negative rate (TNR) for a majority positive training dataset. Generally, the pre-processing technique of oversampling of minority class(es) are used to overcome this deficiency. Our focus is on using the hybridization of Generative Adversarial Network (GAN) and Synthetic Minority Over-Sampling Technique (SMOTE) to address class imbalanced problems. We propose a novel two-phase oversampling approach involving knowledge transfer that has the synergy of SMOTE and GAN. The unrealistic or overgeneralized samples of SMOTE are transformed into realistic distribution of data by GAN where there is not enough minority class data available for GAN to process them by itself effectively. We named it SMOTified-GAN as GAN works on pre-sampled minority data produced by SMOTE rather than randomly generating the samples itself. The experimental results prove the sample quality of minority class(es) has been improved in a variety of tested benchmark datasets. Its performance is improved by up to 9\% from the next best algorithm tested on F1-score measurements. Its time complexity is also reasonable which is around O(N2d2T) for a sequential algorithm
    corecore