42 research outputs found
An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection
Fraud detection has received considerable attention from many academic research and industries worldwide due to its increasing popularity. Insurance datasets are enormous, with skewed distributions and high dimensionality. Skewed class distribution and its volume are considered significant problems while analyzing insurance datasets, as these issues increase the misclassification rates. Although sampling approaches, such as random oversampling and SMOTE can help balance the data, they can also increase the computational complexity and lead to a deterioration of model\u27s performance. So, more sophisticated techniques are needed to balance the skewed classes efficiently. This research focuses on optimizing the learner for fraud detection by applying a Fused Resampling and Cleaning Ensemble (FusedRCE) for effective sampling in health insurance fraud detection. We hypothesized that meticulous oversampling followed with a guided data cleaning would improve the prediction performance and learner\u27s understanding of the minority fraudulent classes compared to other sampling techniques. The proposed model works in three steps. As a first step, PCA is applied to extract the necessary features and reduce the dimensions in the data. In the second step, a hybrid combination of k-means clustering and SMOTE oversampling is used to resample the imbalanced data. Oversampling introduces lots of noise in the data. A thorough cleaning is performed on the balanced data to remove the noisy samples generated during oversampling using the Tomek Link algorithm in the third step. Tomek Link algorithm clears the boundary between minority and majority class samples and makes the data more precise and freer from noise. The resultant dataset is used by four different classification algorithms: Logistic Regression, Decision Tree Classifier, k-Nearest Neighbors, and Neural Networks using repeated 5-fold cross-validation. Compared to other classifiers, Neural Networks with FusedRCE had the highest average prediction rate of 98.9%. The results were also measured using parameters such as F1 score, Precision, Recall and AUC values. The results obtained show that the proposed method performed significantly better than any other fraud detection approach in health insurance by predicting more fraudulent data with greater accuracy and a 3x increase in speed during training
Detection of Vehicle Insurance Claim Fraud: A Fraud Detection Use-Case for the Vehicle Insurance Industry
Insurance fraud has accompanied insurance since its inception, but the manner in which these practices and their methods of operation have evolved over time, and the volume and frequency of insurance fraud incidents have recently increased. Vehicle insurance fraud involves conspiring to make false or exaggerated claims involving property damage or personal injuries following an accident. Some common examples include staged accidents where fraudsters deliberately "arrange" for accidents to occur; the use of phantom passengers, where people who were not even at the scene of the accident claim to have suffered grievous injury, and making false personal injury claims where personal injuries are grossly exaggerated. The typical analysis of these datasets includes Algorithms is implemented on the Weka tool depends upon real info represented through from Oracle Databases. In this paper, focusing on detecting vehicle fraud by using, machine learning algorithms, and also the final analysis and conclusion based on performance steps, revealed that J48 is more accurate than Random Forest, Random Tree, Bayes Net and NaĂŻve Bayes but Random Tree has the lowest classification accuracy
Interactive Learning in Decision Support
De acordo com o dicionário priberam da lĂngua portuguesa, o conceito de Fraude pode ser definido como uma “ação ilĂcita, punĂvel por lei, que procura enganar alguĂ©m ou alguma entidade ou escapar a obrigações legais”. Este tĂłpico tem vindo a ganhar cada vez mais relevância em tempos recentes, com novos casos a se tornarem pĂşblicos de uma forma frequente. Desta forma, existe uma procura contĂnua por soluções que permitam, numa primeira fase, prevenir a ocorrĂŞncia de fraude, ou, caso a mesma já tenha ocorrido, a detetar o mais rapidamente possĂvel. Isto representa um grande desafio: em primeiro lugar, a evolução tecnolĂłgica permite que se elaborem esquemas fraudulentos cada vez mais complexos e eficazes e, portanto, mais difĂceis de detetar e parar. Para alĂ©m disto, os dados e a informação que deles se pode retirar sĂŁo vistos como algo cada vez mais importante no contexto social. Consequentemente, indivĂduos e empresas começaram a recolher e armazenar grandes quantidades de todo o tipo de dados. Isto representa o conceito de Big Data – grandes quantidades de dados de diferentes tipos, com diferentes graus de complexidade, produzidos a ritmos diferentes e provenientes de diferentes fontes. Isto veio, por sua vez, tornar inviável a utilização de tecnologias e algoritmos tradicionais de deteção de fraude, uma vez que estes nĂŁo possuem capacidade para processar um tĂŁo grande conjunto de dados, tĂŁo diversos. É neste contexto que a área de Machine Learning tem vindo a ser cada vez mais explorada, na busca por soluções que permitam dar resposta a este problema.
Normalmente, os sistemas de Machine Learning são vistos como algo completamente autónomo. Nos últimos anos, no entanto, sistemas interativos nos quais especialistas humanos contribuem ativamente no processo de aprendizagem têm vindo a apresentar um desempenho superior quando comparados com sistemas completamente automatizados. Isto pode verificar-se em cenários em que existe um grande conjunto de dados de diversos tipos e de diferentes origens (Big Data), cenários em que o input é um fluxo de dados ou quando existe uma alteração do contexto no qual os dados estão inseridos, num fenómeno conhecido por concept drift.
Tendo isto em conta, neste documento Ă© descrito um projeto cujo tema se insere no contexto da utilização de aprendizagem interativa no suporte Ă decisĂŁo, abordando a temática das auditorias digitais e, mais concretamente, o caso da deteção de fraude fiscal. Desta forma, a solução proposta passa pelo desenvolvimento de um sistema de Machine Learning interativo e dinâmico, na medida em que um dos principais objetivos passa por permitir a um humano especialista no domĂnio nĂŁo sĂł contribuir com o seu conhecimento no processo de aprendizagem do sistema, mas tambĂ©m que este possa contribuir com novo conhecimento, atravĂ©s da sugestĂŁo de uma nova variável ou um novo valor para uma variável já existente, em qualquer altura. O sistema deve entĂŁo ser capaz de integrar o novo conhecimento de uma forma autĂłnoma e continuar com o seu normal funcionamento. Esta Ă©, na verdade, a principal caracterĂstica inovadora da solução proposta, uma vez que em sistemas de Machine Learning tradicionais isto nĂŁo Ă© possĂvel, visto que estes implicam uma estrutura do dataset rĂgida, e em que qualquer alteração neste sentido implicaria um reinĂcio de todo o processo de treino de modelos, desta vez com o novo dataset.Machine Learning has been evolving rapidly over the past years, with new algorithms and approaches being devised to solve the challenges that the new properties of data pose. Specifically, algorithms must now learn continuously and in real time, from very large and possibly distributed datasets.
Usually, Machine Learning systems are seen as something fully automatic. Recently, however, interactive systems in which the human experts actively contribute towards the learning process have shown improved performance when compared to fully automated ones. This may be so on scenarios of Big Data, scenarios in which the input is a data stream, or when there is concept drift.
In this paper, we present a system that learns and adapts in real-time by continuously incorporating user feedback, in a fully autonomous way. Moreover, it allows for users to manage variables (e.g. add, edit, remove), reflecting these changes on-the-fly in the Machine Learning pipeline. This paper describes the main functionalities of the system, which despite being of general-purpose, is being developed in the context of a project in the domain of financial fraud detection
Imbalance Learning and Its Application on Medical Datasets
To gain more valuable information from the increasing large amount of data, data mining has been a hot topic that attracts growing attention in this two decades. One of the challenges in data mining is imbalance learning, which refers to leaning from imbalanced datasets. The imbalanced datasets is dominated by some classes (majority) and other under-represented classes (minority). The imbalanced datasets degrade the learning ability of traditional methods, which are designed on the assumption that all classes are balanced and have equal misclassification costs, leading to the poor performance on the minority classes. This phenomenon is usually called the class imbalance problem. However, it is usually the minority classes of more interest and importance, such as sick cases in the medical dataset. Additionally, traditional methods are optimized to achieve maximum accuracy, which is not suitable for evaluating the performance on imbalanced datasets. From the view of data space, class imbalance could be classified as extrinsic imbalance and intrinsic imbalance. Extrinsic imbalance is caused by external factors, such as data transmission or data storage, while intrinsic imbalance means the dataset is inherently imbalanced due to its nature. Â As extrinsic imbalance could be fixed by collecting more samples, this thesis mainly focus on on two scenarios of the intrinsic imbalance, Â machine learning for imbalanced structured datasets and deep learning for imbalanced image datasets.Â
Normally, the solutions for the class imbalance problem are named as imbalance learning methods, which could be grouped into data-level methods (re-sampling), algorithm-level (re-weighting) methods and hybrid methods. Data-level methods modify the class distribution of the training dataset to create balanced training sets, and typical examples are over-sampling and under-sampling. Instead of modifying the data distribution, algorithm-level methods adjust the misclassification cost to alleviate the class imbalance problem, and one typical example is cost sensitive methods. Hybrid methods usually combine data-level methods and algorithm-level methods. However, existing imbalance learning methods encounter different kinds of problems. Over-sampling methods increase the minority samples to create balanced training sets, which might lead the trained model overfit to the minority class. Under-sampling methods create balanced training sets by discarding majority samples, which lead to the information loss and poor performance of the trained model. Cost-sensitive methods usually need assistance from domain expert to define the misclassification costs which are task specified. Thus, the generalization ability of cost-sensitive methods is poor. Especially, when it comes to the deep learning methods under class imbalance, re-sampling methods may introduce large computation cost and existing re-weighting methods could lead to poor performance. The object of this dissertation is to understand features difference under class imbalance, to improve the classification performance on structured datasets or image datasets. This thesis proposes two machine learning methods for imbalanced structured datasets and one deep learning method for imbalance image datasets. The proposed methods are evaluated on several medical datasets, which are intrinsically imbalanced.Â
Firstly, we study the feature difference between the majority class and the minority class of an imbalanced medical dataset, which is collected from a Chinese hospital. After data cleaning and structuring, we get 3292 kidney stone cases treated by Percutaneous Nephrolithonomy from 2012 to 2019. There are 651 (19.78% ) cases who have postoperative complications, which makes the complication prediction an imbalanced classification task. We propose a sampling-based method SMOTE-XGBoost and implement it to build a postoperative complication prediction model. Experimental results show that the proposed method outperforms classic machine learning methods. Furthermore, traditional prediction models of Percutaneous Nephrolithonomy are designed to predict the kidney stone status and overlook complication related features, which could degrade their prediction performance on complication prediction tasks. To this end, we merge more features into the proposed sampling-based method and further improve the classification performance. Overall, SMOTE-XGBoost achieves an AUC of 0.7077 which is 41.54% higher than that of S.T.O.N.E. nephrolithometry, a traditional prediction model of Percutaneous Nephrolithonomy.
After reviewing the existing machine learning methods under class imbalance, we propose a novel ensemble learning approach called Multiple bAlance Subset Stacking (MASS). MASS first cuts the majority class into multiple subsets by the size of the minority set, and combines each majority subset with the minority set as one balanced subsets. In this way, MASS could overcome the problem of information loss because it does not discard any majority sample. Each balanced subset is used to train one base classifier. Then, the original dataset is feed to all the trained base classifiers, whose output are used to generate the stacking dataset. One stack model is trained by the staking dataset to get the optimal weights for the base classifiers. As the stacking dataset keeps the same labels as the original dataset, which could avoid the overfitting problem. Finally, we can get an ensembled strong model based on the trained base classifiers and the staking model. Extensive experimental results on three medical datasets show that MASS outperforms baseline methods. Â The robustness of MASS is proved over implementing different base classifiers. We design a parallel version MASS to reduce the training time cost. The speedup analysis proves that Parallel MASS could reduce training time cost greatly when applied on large datasets. Specially, Parallel MASS reduces 101.8% training time compared with MASS at most in our experiments.Â
When it comes to the class imbalance problem of image datasets, existing imbalance learning methods suffer from the problem of large training cost and poor performance. Â After introducing the problem of implementing resampling methods on image classification tasks, we demonstrate issues of re-weighting strategy using class frequencies through the experimental result on one medical image dataset. Â We propose a novel re-weighting method Hardness Aware Dynamic loss to solve the class imbalance problem of image datasets. After each training epoch of deep neural networks, we compute the classification hardness of each class. We will assign higher class weights to the classes have large classification hardness values and vice versa in the next epoch. In this way, HAD could tune the weight of each sample in the loss function dynamically during the training process. The experimental results prove that HAD significantly outperforms the state-of-the-art methods. Moreover, HAD greatly improves the classification accuracies of minority classes while only making a small compromise of majority class accuracies. Especially, HAD loss improves 10.04% average precision compared with the best baseline, Focal loss, on the HAM10000 dataset.
At last, I conclude this dissertation with our contributions to the imbalance learning, and provide an overview of potential directions for future research, which include extensions of the three proposed methods, development of task-specified algorithms, and fixing the challenges of within-class imbalance.2021-06-0
Monitoring and optimization of an autonomous learning system
Dissertação de mestrado em Informatics EngineeringIn the last years, the number of Machine Learning algorithms and their parameters has increased significantly.
This allows for more accurate models to be found, but it also increases the complexity of the task of training a
model, as the search space expands significantly.
As datasets keep growing in size, traditional approaches based on extensive search start to become costly
in terms of computational resources and time, especially in data streaming scenarios. With this growth, new
challenges in Machine Learning started to appear. The speed at which data arrives and different ways of storing
data are forcing organizations to address and explore new ways of adapting fast enough so their ML models
don’t become obsolete.
This dissertation aims to develop an approach based on meta-learning that tackles two main challenges: predict ing the performance metrics of a future model and recommending the best algorithm/configuration for training
a model for a specific Machine Learning problem. Throughout this dissertation, all the study objectives and
questions, along with the relevant contextualization will be exposed.
The proposed solution, when compared to an AutoML approach is up to 130x faster and only 2% worse in terms
of average model quality, showing it is a good solution for scenarios in which models need to be updated regularly,
such as in streaming scenarios with Big Data, in which some accuracy can be traded for a much shorter model
training time.Nos últimos anos, o número de algoritmos de Machine Learning e seus parâmetros aumentou significativamente.
Isso permite que modelos mais precisos sejam encontrados, mas também aumenta a complexidade da tarefa
de treinar um modelo, pois o espaço de busca expande-se significativamente.
À medida que os conjuntos de dados continuam a crescer em tamanho, abordagens tradicionais baseadas em uma pesquisa extensiva começam a se tornar caras em termos de recursos computacionais e tempo, especialmente em cenários de streaming de dados. Com esse crescimento, novos desafios no Machine Learning
começaram a aparecer. A velocidade com que os dados chegam e as diferentes maneiras de armazenar dados estão a forçar as organizações a abordar e explorar novas formas de se adaptar rápido o suficiente para que os seus modelos de ML não se tornem obsoletos.
Esta dissertação visa desenvolver uma abordagem baseada em Meta-Learning que aborda dois desafios principais: prever as mĂ©tricas de desempenho de um modelo futuro e recomendar o melhor algoritmo/configuração para treinar um modelo para um problema especĂfico de Machine Learning. Ao longo desta dissertação, serĂŁo expostos todos os objetivos e questões do estudo, juntamente com a contextualização relevante.
A solução proposta, quando comparada a uma abordagem AutoML é até 130x mais rápida e apenas 2% pior em termos de qualidade média do modelo, mostrando que é uma boa solução para cenários em que os modelos precisam ser atualizados regularmente, como em cenários de streaming com Big Data, em que alguma precisão pode ser negociada por um tempo de treino de modelo muito menor
Can adverse childhood experiences predict chronic health conditions? Development of trauma-informed, explainable machine learning models
IntroductionDecades of research have established the association between adverse childhood experiences (ACEs) and adult onset of chronic diseases, influenced by health behaviors and social determinants of health (SDoH). Machine Learning (ML) is a powerful tool for computing these complex associations and accurately predicting chronic health conditions.MethodsUsing the 2021 Behavioral Risk Factor Surveillance Survey, we developed several ML models—random forest, logistic regression, support vector machine, Naïve Bayes, and K-Nearest Neighbor—over data from a sample of 52,268 respondents. We predicted 13 chronic health conditions based on ACE history, health behaviors, SDoH, and demographics. We further assessed each variable’s importance in outcome prediction for model interpretability. We evaluated model performance via the Area Under the Curve (AUC) score.ResultsWith the inclusion of data on ACEs, our models outperformed or demonstrated similar accuracies to existing models in the literature that used SDoH to predict health outcomes. The most accurate models predicted diabetes, pulmonary diseases, and heart attacks. The random forest model was the most effective for diabetes (AUC = 0.784) and heart attacks (AUC = 0.732), and the logistic regression model most accurately predicted pulmonary diseases (AUC = 0.753). The strongest predictors across models were age, ever monitored blood sugar or blood pressure, count of the monitoring behaviors for blood sugar or blood pressure, BMI, time of last cholesterol check, employment status, income, count of vaccines received, health insurance status, and total ACEs. A cumulative measure of ACEs was a stronger predictor than individual ACEs.DiscussionOur models can provide an interpretable, trauma-informed framework to identify and intervene with at-risk individuals early to prevent chronic health conditions and address their inequalities in the U.S
Sequential learning and shared representation for sensor-based human activity recognition
Human activity recognition based on sensor data has rapidly attracted considerable research attention due to its wide range of applications including senior monitoring, rehabilitation, and healthcare. These applications require accurate systems of human activity recognition to track and understand human behaviour. Yet, developing such accurate systems pose critical challenges and struggle to learn from temporal sequential sensor data due to the variations and complexity of human activities. The main challenges of developing human activity recognition are accuracy and robustness due to the diversity and similarity of human activities, skewed distribution of human activities, and also lack of a rich quantity of wellcurated human activity data. This thesis addresses these challenges by developing robust deep sequential learning models to boost the performance of human activity recognition and handle the imbalanced class problems as well as reduce the need for a large amount of annotated data.
This thesis develops a set of new networks specifically designed for the challenges in building better HAR systems compared to the existing methods. First, this thesis proposes robust and sequential deep learning models to accurately recognise human activities and boost the performance of the human activity recognition systems against the current methods from smart home and wearable sensors collected data. The proposed methods integrate convolutional neural networks and different attention mechanisms to efficiently process human activity data and capture significant information for recognising human activities.
Next, the thesis proposes methods to address the imbalanced class problems for human activity recognition systems. Joint learning of sequential deep learning algorithms, i.e., long short-term memory and convolutional neural networks is proposed to boost the performance of human activity recognition, particularly for infrequent human activities. In addition to that, also propose a data-level solution to address imbalanced class problems by extending the synthetic minority over-sampling technique (SMOTE) which we named (iSMOTE) to accurately label the generated synthetic samples. These methods have enhanced the results of the minority human activities and outperformed the current state-of-the-art methods.
In this thesis, sequential deep learning networks are proposed to boost the performance of human activity recognition in addition to reducing the dependency for a rich quantity of well-curated human activity data by transfer learning techniques. A multi-domain learning network is proposed to process data from multi-domains, transfer knowledge across different but related domains of human activities and mitigate isolated learning paradigms using a shared representation. The advantage of the proposed method is firstly to reduce the need and effort for labelled data of the target domain. The proposed network uses the training data of the target domain with restricted size and the full training data of the source domain, yet provided better performance than using the full training data in a single domain setting. Secondly, the proposed method can be used for small datasets. Lastly, the proposed multidomain learning network reduces the training time by rendering a generic model for related domains compared to fitting a model for each domain separately.
In addition, the thesis also proposes a self-supervised model to reduce the need for a considerable amount of annotated human activity data. The self-supervised method is pre-trained on the unlabeled data and fine-tuned on a small amount of labelled data for supervised learning. The proposed self-supervised pre-training network renders human activity representations that are semantically meaningful and provides a good initialization for supervised fine tuning. The developed network enhances the performance of human activity recognition in addition to minimizing the need for a considerable amount of labelled data.
The proposed models are evaluated by multiple public and benchmark datasets of sensorbased human activities and compared with the existing state-of-the-art methods. The experimental results show that the proposed networks boost the performance of human activity recognition systems
A Learning Health System for Radiation Oncology
The proposed research aims to address the challenges faced by clinical data science researchers in radiation oncology accessing, integrating, and analyzing heterogeneous data from various sources. The research presents a scalable intelligent infrastructure, called the Health Information Gateway and Exchange (HINGE), which captures and structures data from multiple sources into a knowledge base with semantically interlinked entities. This infrastructure enables researchers to mine novel associations and gather relevant knowledge for personalized clinical outcomes.
The dissertation discusses the design framework and implementation of HINGE, which abstracts structured data from treatment planning systems, treatment management systems, and electronic health records. It utilizes disease-specific smart templates for capturing clinical information in a discrete manner. HINGE performs data extraction, aggregation, and quality and outcome assessment functions automatically, connecting seamlessly with local IT/medical infrastructure.
Furthermore, the research presents a knowledge graph-based approach to map radiotherapy data to an ontology-based data repository using FAIR (Findable, Accessible, Interoperable, Reusable) concepts. This approach ensures that the data is easily discoverable and accessible for clinical decision support systems. The dissertation explores the ETL (Extract, Transform, Load) process, data model frameworks, ontologies, and provides a real-world clinical use case for this data mapping.
To improve the efficiency of retrieving information from large clinical datasets, a search engine based on ontology-based keyword searching and synonym-based term matching tool was developed. The hierarchical nature of ontologies is leveraged to retrieve patient records based on parent and children classes. Additionally, patient similarity analysis is conducted using vector embedding models (Word2Vec, Doc2Vec, GloVe, and FastText) to identify similar patients based on text corpus creation methods. Results from the analysis using these models are presented.
The implementation of a learning health system for predicting radiation pneumonitis following stereotactic body radiotherapy is also discussed. 3D convolutional neural networks (CNNs) are utilized with radiographic and dosimetric datasets to predict the likelihood of radiation pneumonitis. DenseNet-121 and ResNet-50 models are employed for this study, along with integrated gradient techniques to identify salient regions within the input 3D image dataset. The predictive performance of the 3D CNN models is evaluated based on clinical outcomes.
Overall, the proposed Learning Health System provides a comprehensive solution for capturing, integrating, and analyzing heterogeneous data in a knowledge base. It offers researchers the ability to extract valuable insights and associations from diverse sources, ultimately leading to improved clinical outcomes. This work can serve as a model for implementing LHS in other medical specialties, advancing personalized and data-driven medicine
Rationalising health care provision under market incentives: experimental evidence from south Africa
Unnecessary medical treatments place a significant burden on health systems striving for universal health coverage (UHC). This thesis studies inappropriate treatment incentives in the private sector in South Africa, where plans to implement a national health insurance system (NHI) foresee the contracting of private physicians to deliver publicly-funded health care. Private providers are increasingly recognized as necessary partners for UHC success in many low-and-middle-income countries (LMIC). However, aligning the incentives of these actors with UHC and public health goals requires a better understanding of incentive effects in these settings.
I conduct two field experiments with incognito standardized patients (SPs), to both evaluate appropriate care provision and experimentally vary the treatment incentives facing private physicians. First, I run a within-subject experiment with 89 private primary care physicians (GPs) in Johannesburg, to investigate the causal impact of improving patients’ financial protection (insurance cover) on physicians’ quality of care delivery. The results suggest that more insured patients receive a higher level of visible clinical effort, but a lower level of technical care quality – including a higher likelihood of inappropriate antibiotic treatment. Second, I use data from the same experiment to evaluate the impact of patient insurance on the quantity and costs of care. I find that more insured patients are more likely to receive unnecessary diagnostic tests and treatment procedures, and receive more and more expensive branded drugs, resulting in significantly higher care costs. The results on antibiotic treatment and drug treatment quantity and costs occurred despite the absence of any financial incentives attached to drug prescribing for GPs, which suggests the presence of alternative motives for physicians’ treatment decisions that might vary with patient insurance – including intrinsic or altruistic motives. Third, I explore the scope for leveraging such intrinsic motivations to improve physicians’ treatment choices. I conduct a randomized (between-subject) experiment with 80 GPs, to evaluate the impact of intrinsic, informational incentives from private performance audit and feedback (A&F) on physicians’ antibiotic treatment choices and care costs. The findings suggest that private A&F can significantly reduce the likelihood of inappropriate antibiotic treatment for common viral infections that present in primary care, without simultaneously reducing appropriate antibiotic use for bacterial infections or increasing other inappropriate drug treatments. However, improved performance on antibiotic use does not coincide with significantly lower treatment costs or any improvements in measured diagnostic effort or accuracy. There is indicative evidence that prescribing norms and perceived patient expectations may play an important role in mediating private physicians’ treatment choices in all three empirical chapters