436 research outputs found

    A Comprehensive Survey on Rare Event Prediction

    Full text link
    Rare event prediction involves identifying and forecasting events with a low probability using machine learning and data analysis. Due to the imbalanced data distributions, where the frequency of common events vastly outweighs that of rare events, it requires using specialized methods within each step of the machine learning pipeline, i.e., from data processing to algorithms to evaluation protocols. Predicting the occurrences of rare events is important for real-world applications, such as Industry 4.0, and is an active research area in statistical and machine learning. This paper comprehensively reviews the current approaches for rare event prediction along four dimensions: rare event data, data processing, algorithmic approaches, and evaluation approaches. Specifically, we consider 73 datasets from different modalities (i.e., numerical, image, text, and audio), four major categories of data processing, five major algorithmic groupings, and two broader evaluation approaches. This paper aims to identify gaps in the current literature and highlight the challenges of predicting rare events. It also suggests potential research directions, which can help guide practitioners and researchers.Comment: 44 page

    Show Me Your Claims and I\u27ll Tell You Your Offenses: Machine Learning-Based Decision Support for Fraud Detection on Medical Claim Data

    Get PDF
    Health insurance claim fraud is a serious issue for the healthcare industry as it drives up costs and inefficiency. Therefore, claim fraud must be effectively detected to provide economical and high-quality healthcare. In practice, however, fraud detection is mainly performed by domain experts resulting in significant cost and resource consumption. This paper presents a novel Convolutional Neural Network-based fraud detection approach that was developed, implemented, and evaluated on Medicare Part B records. The model aids manual fraud detection by classifying potential types of fraud, which can then be specifically analyzed. Our model is the first of its kind for Medicare data, yields an AUC of 0.7 for selected fraud types and provides an applicable method for medical claim fraud detection

    Show Me Your Claims and I'll Tell You Your Offenses: Machine Learning-Based Decision Support for Fraud Detection on Medical Claim Data

    Get PDF
    Health insurance claim fraud is a serious issue for the healthcare industry as it drives up costs and inefficiency. Therefore, claim fraud must be effectively detected to provide economical and high-quality healthcare. In practice, however, fraud detection is mainly performed by domain experts resulting in significant cost and resource consumption. This paper presents a novel Convolutional Neural Network-based fraud detection approach that was developed, implemented, and evaluated on Medicare Part B records. The model aids manual fraud detection by classifying potential types of fraud, which can then be specifically analyzed. Our model is the first of its kind for Medicare data, yields an AUC of 0.7 for selected fraud types and provides an applicable method for medical claim fraud detection

    Exploration of Data Science Toolbox and Predictive Models to Detect and Prevent Medicare Fraud, Waste, and Abuse

    Get PDF
    The Federal Department of Health and Human Services spends approximately 830BillionannuallyonMedicareofwhichanestimated830 Billion annually on Medicare of which an estimated 30 to $110 billion is some form of fraud, waste, or abuse (FWA). Despite the Federal Government’s ongoing auditing efforts, fraud, waste, and abuse is rampant and requires modern machine learning approaches to generalize and detect such patterns. New and novel machine learning algorithms offer hope to help detect fraud, waste, and abuse. The existence of publicly accessible datasets complied by The Centers for Medicare & Medicaid Services (CMS) contain vast quantities of structured data. This data, coupled with industry standardized billing codes provides many opportunities for the application of machine learning for fraud, waste, and abuse detection. This research aims to develop a new model utilizing machine learning to generalize the patterns of fraud, waste, and abuse in Medicare. This task is accomplished by linking provider and payment data with the list of excluded individuals and entities to train an Isolation Forest algorithm on previously fraudulent behavior. Results indicate anomalous instances occurring in 0.2% of all analyzed claims, demonstrating machine learning models’ predictive ability to detect FWA

    Unsupervised learning for anomaly detection in Australian medical payment data

    Full text link
    Fraudulent or wasteful medical insurance claims made by health care providers are costly for insurers. Typically, OECD healthcare organisations lose 3-8% of total expenditure due to fraud. As Australia’s universal public health insurer, Medicare Australia, spends approximately A34billionperannumontheMedicareBenefitsSchedule(MBS)andPharmaceuticalBenefitsScheme,wastedspendingofA 34 billion per annum on the Medicare Benefits Schedule (MBS) and Pharmaceutical Benefits Scheme, wasted spending of A1–2.7 billion could be expected.However, fewer than 1% of claims to Medicare Australia are detected as fraudulent, below international benchmarks. Variation is common in medicine, and health conditions, along with their presentation and treatment, are heterogenous by nature. Increasing volumes of data and rapidly changing patterns bring challenges which require novel solutions. Machine learning and data mining are becoming commonplace in this field, but no gold standard is yet available. In this project, requirements are developed for real-world application to compliance analytics at the Australian Government Department of Health and Aged Care (DoH), covering: unsupervised learning; problem generalisation; human interpretability; context discovery; and cost prediction. Three novel methods are presented which rank providers by potentially recoverable costs. These methods used association analysis, topic modelling, and sequential pattern mining to provide interpretable, expert-editable models of typical provider claims. Anomalous providers are identified through comparison to the typical models, using metrics based on costs of excess or upgraded services. Domain knowledge is incorporated in a machine-friendly way in two of the methods through the use of the MBS as an ontology. Validation by subject-matter experts and comparison to existing techniques shows that the methods perform well. The methods are implemented in a software framework which enables rapid prototyping and quality assurance. The code is implemented at the DoH, and further applications as decision-support systems are in progress. The developed requirements will apply to future work in this fiel

    Literature Review of Credit Card Fraud Detection with Machine Learning

    Get PDF
    This thesis presents a comprehensive examination of the field of credit card fraud detection, aiming to offer a thorough understanding of its evolution and nuances. Through a synthesis of various studies, methodologies, and technologies, this research strives to provide a holistic perspective on the subject, shedding light on both its strengths and limitations. In the realm of credit card fraud detection, a range of methods and combinations have been explored to enhance effectiveness. This research reviews several noteworthy approaches, including Genetic Algorithms (GA) coupled with Random Forest (GA-RF), Decision Trees (GA-DT), and Artificial Neural Networks (GA-ANN). Additionally, the study delves into outlier score definitions, considering different levels of granularity, and their integration into a supervised framework. Moreover, it discusses the utilization of Artificial Neural Networks (ANNs) in federated learning and the incorporation of Generative Adversarial Networks (GANs) with Modified Focal Loss and Random Forest as the base machine learning algorithm. These methods, either independently or in combination, represent some of the most recent developments in credit card fraud detection, showcasing their potential to address the evolving landscape of digital financial threats. The scope of this literature review encompasses a wide range of sources, including research articles, academic papers, and industry reports, spanning multiple disciplines such as computer science, data science, artificial intelligence, and cybersecurity. The review is organized to guide readers through the progression of credit card fraud detection, commencing with foundational concepts and advancing toward the most recent developments. In today's digital financial landscape, the need for robust defense mechanisms against credit card fraud is undeniable. By critically assessing the existing literature, recognizing emerging trends, and evaluating the effectiveness of various detection methods, this thesis aims to contribute to the knowledge pool within the credit card fraud detection domain. The insights gleaned from this comprehensive review will not only benefit researchers and practitioners but also serve as a roadmap for the enhancement of more adaptive and resilient fraud detection systems. As the ongoing battle between fraudsters and defenders in the financial realm continues to evolve, a deep understanding of the current landscape becomes an asset. This literature review aspires to equip readers with the insights needed to address the dynamic challenges associated with credit card fraud detection, fostering innovation and resilience in the pursuit of secure and trustworthy financial transactions

    Learning With An Insufficient Supply Of Data Via Knowledge Transfer And Sharing

    Get PDF
    As machine learning methods extend to more complex and diverse set of problems, situations arise where the complexity and availability of data presents a situation where the information source is not adequate to generate a representative hypothesis. Learning from multiple sources of data is a promising research direction as researchers leverage ever more diverse sources of information. Since data is not readily available, knowledge has to be transferred from other sources and new methods (both supervised and un-supervised) have to be developed to selectively share and transfer knowledge. In this dissertation, we present both supervised and un-supervised techniques to tackle a problem where learning algorithms cannot generalize and require an extension to leverage knowledge from different sources of data. Knowledge transfer is a difficult problem as diverse sources of data can overwhelm each individual dataset\u27s distribution and a careful set of transformations has to be applied to increase the relevant knowledge at the risk of biasing a dataset\u27s distribution and inducing negative transfer that can degrade a learner\u27s performance. We give an overview of the issues encountered when the learning dataset does not have a sufficient supply of training examples. We categorize the structure of small datasets and highlight the need for further research. We present an instance-transfer supervised classification algorithm to improve classification performance in a target dataset via knowledge transfer from an auxiliary dataset. The improved classification performance of our algorithm is demonstrated with several real-world experiments. We extend the instance-transfer paradigm to supervised classification with Absolute Rarity\u27 , where a dataset has an insufficient supply of training examples and a skewed class distribution. We demonstrate one solution with a transfer learning approach and another with an imbalanced learning approach and demonstrate the effectiveness of our algorithms with several real world text and demographics classification problems (among others). We present an unsupervised multi-task clustering algorithm where several small datasets are simultaneously clustered and knowledge is transferred between the datasets to improve clustering performance on each individual dataset and we demonstrate the improved clustering performance with an extensive set of experiments

    Cost-Sensitive Learning-based Methods for Imbalanced Classification Problems with Applications

    Get PDF
    Analysis and predictive modeling of massive datasets is an extremely significant problem that arises in many practical applications. The task of predictive modeling becomes even more challenging when data are imperfect or uncertain. The real data are frequently affected by outliers, uncertain labels, and uneven distribution of classes (imbalanced data). Such uncertainties create bias and make predictive modeling an even more difficult task. In the present work, we introduce a cost-sensitive learning method (CSL) to deal with the classification of imperfect data. Typically, most traditional approaches for classification demonstrate poor performance in an environment with imperfect data. We propose the use of CSL with Support Vector Machine, which is a well-known data mining algorithm. The results reveal that the proposed algorithm produces more accurate classifiers and is more robust with respect to imperfect data. Furthermore, we explore the best performance measures to tackle imperfect data along with addressing real problems in quality control and business analytics

    Learning To Detect And Localize Anomaly Using Thin Plate Spline Transformation

    Get PDF
    Detecting and localizing anomalies in vision applications is a topic of interest in the field of computer vision and machine learning. Detecting defective products of a factory production line by analyzing images of final products, recognizing the existence of injuries and diseases in medical images, and video surveillance are some of these applications in which irregular patterns that differ significantly from normal ones will be detected by appropriate anomaly detection methods. Although many recent researches have focused on developing data-driven methods to find visual defects properly, they face some challenges due to inherent properties of abnormalities such as unknownness, rarity, and diversity. Since anomalies are diverse and unknown, as any type of irregularity can be considered an anomaly, and they are unknown until they occur in the real world, it is challenging to develop a generalized model that can detect all types of unknown anomalies precisely. The main goal of this thesis is to investigate these challenges in more detail and try to develop a generalized model that can detect and locate various types of subtle and large-size anomalies properly. We find out that using simulated anomalies that are similar to real defects in the training procedure of a model helps to develop more generalized detectors. The most important thing in creating artificial anomalies is that they should be as similar to real defects as possible and random in size and location to meet the diversity and unknownness properties of real anomalies. In this regard, we develop a two-stage self-supervised learning approach where in the first stage, a pre-trained neural network is optimized with the help of artificial anomalies of various sizes and shapes, which are created based on applying random thin-plate spline (TPS) transformation to the eminent area of normal images selected by the Canny edge detector technique, and then in the second stage, the optimized model is utilized to detect anomalous data from the normal ones. We evaluate the proposed method on the MVTec dataset and discover that it outperforms the previous anomaly detection methods due to the ability of TPS transformation to simulate various types of fine-grained and large-size defects that are monolithic in borders and are more similar to real defects. Utilizing the Canny edge detector also helps the method to create anomalies on the random prominent areas of an image instead of background areas which itself leads to better results. Moreover, this method is computationally efficient in both training and testing phases since it fine-tunes a pre-trained model instead of training one from scratch. These features make our approach a suitable candidate for detecting and localizing anomalies in real-world applications, as we discuss in this thesis

    Large Language Models for Forecasting and Anomaly Detection: A Systematic Literature Review

    Full text link
    This systematic literature review comprehensively examines the application of Large Language Models (LLMs) in forecasting and anomaly detection, highlighting the current state of research, inherent challenges, and prospective future directions. LLMs have demonstrated significant potential in parsing and analyzing extensive datasets to identify patterns, predict future events, and detect anomalous behavior across various domains. However, this review identifies several critical challenges that impede their broader adoption and effectiveness, including the reliance on vast historical datasets, issues with generalizability across different contexts, the phenomenon of model hallucinations, limitations within the models' knowledge boundaries, and the substantial computational resources required. Through detailed analysis, this review discusses potential solutions and strategies to overcome these obstacles, such as integrating multimodal data, advancements in learning methodologies, and emphasizing model explainability and computational efficiency. Moreover, this review outlines critical trends that are likely to shape the evolution of LLMs in these fields, including the push toward real-time processing, the importance of sustainable modeling practices, and the value of interdisciplinary collaboration. Conclusively, this review underscores the transformative impact LLMs could have on forecasting and anomaly detection while emphasizing the need for continuous innovation, ethical considerations, and practical solutions to realize their full potential
    • 

    corecore