12 research outputs found

    Applying Supervised Machine Learning Algorithms for Fraud Detection in Anti-Money Laundering

    Get PDF
    As international money transfers become more automated, it becomes easier for criminals to transfer money across borders in a fraction of a second, while it also becomes easier for regulators to inspect and monitor international money mobility and identify unusual patterns of money movement. Machine learning algorithms may be a useful addition to the current money laundering detection issues. This research empirically tested four machine learning algorithms (Logistic regression, SVM, Random Forest, and ANN) using a synthetic dataset that closely matches regular transaction behavior. After observing the performance of different algorithms, it can be stated that the Random Forest technique, when compared to the other techniques, provides the best accuracy. The least accurate approach was the Artificial Neural Network (ANN)

    Randomized outlier detection with trees

    Get PDF
    Isolation forest (IF) is a popular outlier detection algorithm that isolates outlier observations from regular observations by building multiple random isolation trees. The average number of comparisons required to isolate a given observation can then be used as a measure of its outlierness. Multiple extensions of this approach have been proposed in the literature including the extended isolation forest (EIF) as well as the SCiForest. However, we find a lack of theoretical explanation on why IF, EIF, and SCiForest offer such good practical performance. In this paper, we present a theoretical framework that views these approaches from a distributional viewpoint. Using this viewpoint, we show that isolation-based approaches first accurately approximate the data distribution and then secondly approximate the coefficients of mixture components using the average path length. Using this framework, we derive the generalized isolation forest (GIF) that also trains random isolation trees, but combining them moves beyond using the average path length. That is, GIF splits the data into multiple sub-spaces by sampling random splits as do the original IF variants do and directly estimates the mixture coefficients of a mixture distribution to score the outlierness on entire regions of data. In an extensive evaluation, we compare GIF with 18 state-of-the-art outlier detection methods on 14 different datasets. We show that GIF outperforms three competing tree-based methods and has a competitive performance to other nearest-neighbor approaches while having a lower runtime. Last, we highlight a use-case study that uses GIF to detect transaction fraud in financial data

    Realistic Synthetic Financial Transactions for Anti-Money Laundering Models

    Full text link
    With the widespread digitization of finance and the increasing popularity of cryptocurrencies, the sophistication of fraud schemes devised by cybercriminals is growing. Money laundering -- the movement of illicit funds to conceal their origins -- can cross bank and national boundaries, producing complex transaction patterns. The UN estimates 2-5\% of global GDP or \$0.8 - \$2.0 trillion dollars are laundered globally each year. Unfortunately, real data to train machine learning models to detect laundering is generally not available, and previous synthetic data generators have had significant shortcomings. A realistic, standardized, publicly-available benchmark is needed for comparing models and for the advancement of the area. To this end, this paper contributes a synthetic financial transaction dataset generator and a set of synthetically generated AML (Anti-Money Laundering) datasets. We have calibrated this agent-based generator to match real transactions as closely as possible and made the datasets public. We describe the generator in detail and demonstrate how the datasets generated can help compare different Graph Neural Networks in terms of their AML abilities. In a key way, using synthetic data in these comparisons can be even better than using real data: the ground truth labels are complete, whilst many laundering transactions in real data are never detected

    A deep generative model framework for creating high quality synthetic transaction sequences

    Get PDF
    Synthetic data are artificially generated data that closely model real-world measurements, and can be a valuable substitute for real data in domains where it is costly to obtain real data, or privacy concerns exist. Synthetic data has traditionally been generated using computational simulations, but deep generative models (DGMs) are increasingly used to generate high-quality synthetic data. In this thesis, we create a framework which employs DGMs for generating highquality synthetic transaction sequences. Transaction sequences, such as we may see in an online banking platform, or credit card statement, are important type of financial data for gaining insight into financial systems. However, research involving this type of data is typically limited to large financial institutions, as privacy concerns often prevent academic researchers from accessing this kind of data. Our work represents a step towards creating shareable synthetic transaction sequence datasets, containing data not connected to any actual humans. To achieve this goal, we begin by developing Banksformer, a DGM based on the transformer architecture, which is able to generate high-quality synthetic transaction sequences. Throughout the remainder of the thesis, we develop extensions to Banksformer that further improve the quality of data we generate. Additionally, we perform extensively examination of the quality synthetic data produced by our method, both with qualitative visualizations and quantitative metrics

    Unsupervised Intrusion Detection with Cross-Domain Artificial Intelligence Methods

    Get PDF
    Cybercrime is a major concern for corporations, business owners, governments and citizens, and it continues to grow in spite of increasing investments in security and fraud prevention. The main challenges in this research field are: being able to detect unknown attacks, and reducing the false positive ratio. The aim of this research work was to target both problems by leveraging four artificial intelligence techniques. The first technique is a novel unsupervised learning method based on skip-gram modeling. It was designed, developed and tested against a public dataset with popular intrusion patterns. A high accuracy and a low false positive rate were achieved without prior knowledge of attack patterns. The second technique is a novel unsupervised learning method based on topic modeling. It was applied to three related domains (network attacks, payments fraud, IoT malware traffic). A high accuracy was achieved in the three scenarios, even though the malicious activity significantly differs from one domain to the other. The third technique is a novel unsupervised learning method based on deep autoencoders, with feature selection performed by a supervised method, random forest. Obtained results showed that this technique can outperform other similar techniques. The fourth technique is based on an MLP neural network, and is applied to alert reduction in fraud prevention. This method automates manual reviews previously done by human experts, without significantly impacting accuracy

    Effective Data Analytics and Security Strategies in Internal Audit Organizations

    Get PDF
    The digitization of the corporate and regulatory environment presents an opportunity for internal audit organizations to change their audit techniques and increase their value to corporations. Audit functions have not kept pace with these advancements, as evidenced by the massive frauds in recent years, and current audit methodology does not robustly incorporate analytics and security of data. Grounded in agency theory, the purpose of this qualitative case study was to explore successful strategies business leaders use to implement data analytics and security for internal auditing and fraudulent activity. The participants comprised 3 audit leaders in Pennsylvania, who effectively used data analytics and security strategies to promote quality audits and detect fraud. Data collected from semistructured interviews and company documents facilitated thematic analysis. Four themes emerged: data analytics framework, human capital, technology, and stakeholder engagement. A key recommendation for successful implementation of data analytics is a strategy to staff the audit function with the appropriate level of IT and finance skills, which would reduce the overall cost of implementation. The implications for positive social change include the potential to increase confidence in financial statements and the potential for job opportunities and support of economic growth in the local communities

    Bridging the Domain-Gap in Computer Vision Tasks

    Get PDF

    AHEAD: Adaptive Hierarchical Decomposition for Range Query under Local Differential Privacy

    Get PDF
    For protecting users' private data, local differential privacy (LDP) has been leveraged to provide the privacy-preserving range query, thus supporting further statistical analysis. However, existing LDP-based range query approaches are limited by their properties, i.e., collecting user data according to a pre-defined structure. These static frameworks would incur excessive noise added to the aggregated data especially in the low privacy budget setting. In this work, we propose an Adaptive Hierarchical Decomposition (AHEAD) protocol, which adaptively and dynamically controls the built tree structure, so that the injected noise is well controlled for maintaining high utility. Furthermore, we derive a guideline for properly choosing parameters for AHEAD so that the overall utility can be consistently competitive while rigorously satisfying LDP. Leveraging multiple real and synthetic datasets, we extensively show the effectiveness of AHEAD in both low and high dimensional range query scenarios, as well as its advantages over the state-of-the-art methods. In addition, we provide a series of useful observations for deploying AHEAD in practice
    corecore