9 research outputs found

    Generating Multi-Categorical Samples with Generative Adversarial Networks

    Get PDF
    We propose a method to train generative adversarial networks on mutivariate feature vectors representing multiple categorical values. In contrast to the continuous domain, where GAN-based methods have delivered considerable results, GANs struggle to perform equally well on discrete data. We propose and compare several architectures based on multiple (Gumbel) softmax output layers taking into account the structure of the data. We evaluate the performance of our architecture on datasets with different sparsity, number of features, ranges of categorical values, and dependencies among the features. Our proposed architecture and method outperforms existing models

    Machine Learning Techniques for Suspicious Transaction Detection and Analysis

    Get PDF
    Financial services must monitor their transactions to prevent being used for money laundering and combat the financing of terrorism. Initially, organizations in charge of fraud regulation were only concerned about financial institutions such as banks. However, nowadays, the Fintech industry, online businesses, or platforms involving virtual assets can also be affected by similar criminal schemes. Regardless of the differences between the entities mentioned above, malicious activities affecting them share many common patterns. This dissertation's first goal is to compile and compare existing studies involving machine learning to detect and analyze suspicious transactions. The second goal is to synthesize methodologies from the last goal for tackling different use cases in an organized manner. Finally, the third goal is to assess the applicability of deep generative models for enhancing existing solutions. In the first part of the thesis, we propose an unsupervised methodology for detecting suspicious transactions applied to two case studies. One is related to transactions from a money remittance network, and the other is related to a novel payment network based on distributed ledger technologies. Anomaly detection algorithms are applied to rank user accounts based on recency, frequency, and monetary features. The results are manually validated by domain experts, confirming known scenarios and finding unexpected new cases. In the second part, we carry out an analogous analysis employing supervised methods, along with a case study where we classify Ethereum smart contracts into honeypots and non-honeypots. We take features from the source code, the transaction data, and the funds' flow characterization. The proposed classification models proved to generalize well to unseen honeypot instances and techniques and allowed us to characterize previously unknown techniques. In the third part, we analyze the challenges that tabular data brings into the domain of deep generative models, a particular type of data used to represent financial transactions in the previous two parts. We propose a new model architecture by adapting state-of-the-art methods to output multiple variables from mixed types distributions. Additionally, we extend the evaluation metrics used in the literature to the multi-output setting, and we show empirically that our approach outperforms the existing methods. Finally, in the last part, we extend the work from the third part by applying the presented models to enhance classification tasks from the second part, commonly containing a severe class imbalance. We introduce the multi-input architecture to expand models alongside our previously proposed multi-output architecture. We compare three techniques to sample from deep generative models defining a transparent and fair large-scale experimental protocol and interesting visual analysis tools. We showed that general machine learning detection and visualization techniques could help address the fraud detection domain's many challenges. In particular, deep generative models can add value to the classification task given the imbalanced nature of the fraudulent class, in exchange for implementation and time complexity. Future and promising applications for deep generative models include missing data imputation and sharing synthetic data or data generators preserving privacy constraints

    Working with Deep Generative Models and Tabular Data Imputation

    Get PDF
    Datasets with missing values are very common in industry applications. Missing data typically have a negative impact on machine learning models. With the rise of generative models in deep learning, recent studies proposed solutions to the problem of imputing missing values based various deep generative models. Previous experiments with Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) showed promising results in this domain. Initially, these results focused on imputation in image data, e.g. filling missing patches in images. Recent proposals addressed missing values in tabular data. For these data, the case for deep generative models seems to be less clear. In the process of providing a fair comparison of proposed methods, we uncover several issues when assessing the status quo: the use of under-specified and ambiguous dataset names, the large range of parameters and hyper-parameters to tune for each method, and the use of different metrics and evaluation methods

    Finding Suspicious Activities in Financial Transactions and Distributed Ledgers

    No full text
    Banks and financial institutions around the world must comply with several policies for the prevention of money laundering and in order to combat the financing of terrorism. Nowadays, there is a raise in the popularity of novel financial technologies such as digital currencies, social trading platforms and distributed ledger payments, but there is a lack of approaches to enforce the aforementioned regulations accordingly. Software tools are developed to detect suspicious transactions usually based on knowledge from experts in the domain, but as new criminal tactics emerge, detection mechanisms must be updated. Suspicious activity examples are scarce or nonexistent, hindering the use of supervised machine learning methods. In this paper, we describe a methodology for analyzing financial information without the use of ground truth. A user suspicion ranking is generated in order to facilitate human expert validation using an ensemble of anomaly detection algorithms. We apply our procedure over two case studies: one related to bank fund movements from a private company and the other concerning Ripple network transactions. We illustrate how both examples share interesting similarities and that the resulting user ranking leads to suspicious findings, showing that anomaly detection is a must in both traditional and modern payment systems

    A Data Science Approach for Honeypot Detection in Ethereum

    No full text
    Ethereum smart contracts have recently drawn a considerable amount of attention from the media, the financial industry and academia. With the increase in popularity, malicious users found new opportunities to profit by deceiving newcomers. Consequently, attackers started luring other attackers into contracts that seem to have exploitable flaws, but that actually contain a complex hidden trap that in the end benefits the contract creator. In the blockchain community, these contracts are known as honeypots. A recent study presented a tool called HONEYBADGER that uses symbolic execution to detect honeypots by analyzing contract bytecode. In this paper, we present a data science detection approach based foremost on the contract transaction behavior. We create a partition of all the possible cases of fund movements between the contract creator, the contract, the transaction sender and other participants. To this end, we add transaction aggregated features, such as the number of transactions and the corresponding mean value and other contract features, for example compilation information and source code length. We find that all aforementioned categories of features contain useful information for the detection of honeypots. Moreover, our approach allows us to detect new, previously undetected honeypots of already known techniques. We furthermore employ our method to test the detection of unknown honeypot techniques by sequentially removing one technique from the training set. We show that our method is capable of discovering the removed honeypot techniques. Finally, we discovered two new techniques that were previously not known

    On non-parametric models for detecting outages in the mobile network

    No full text
    The wireless/cellular communications network is composed of a complex set of interconnected computation units that form the mobile core network. The mobile core network is engineered to be fault tolerant and redundant; small errors that manifest themselves in the network are usually resolved automatically. However, some errors remain latent, and if discovered early enough can provide warnings to the network operator about a pending service outage. For mobile network operators, it is of high interest to detect these minor anomalies near real-time. In this work we use performance data from a 4G-LTE network carrier to train two parameter-free models. A first model relies on isolation forests, and the second is histogram based. The trained models represent the data characteristics for normal periods; new data is matched against the trained models to classify the new time period as being normal or abnormal. We show that the proposed methods can gauge the mobile network state with more subtlety than standard success/failure thresholds used in real-world networks today
    corecore