4,921 research outputs found

    Boosting insights in insurance tariff plans with tree-based machine learning methods

    Full text link
    Pricing actuaries typically operate within the framework of generalized linear models (GLMs). With the upswing of data analytics, our study puts focus on machine learning methods to develop full tariff plans built from both the frequency and severity of claims. We adapt the loss functions used in the algorithms such that the specific characteristics of insurance data are carefully incorporated: highly unbalanced count data with excess zeros and varying exposure on the frequency side combined with scarce, but potentially long-tailed data on the severity side. A key requirement is the need for transparent and interpretable pricing models which are easily explainable to all stakeholders. We therefore focus on machine learning with decision trees: starting from simple regression trees, we work towards more advanced ensembles such as random forests and boosted trees. We show how to choose the optimal tuning parameters for these models in an elaborate cross-validation scheme, we present visualization tools to obtain insights from the resulting models and the economic value of these new modeling approaches is evaluated. Boosted trees outperform the classical GLMs, allowing the insurer to form profitable portfolios and to guard against potential adverse risk selection

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Random projection to preserve patient privacy

    Get PDF
    With the availability of accessible and widely used cloud services, it is natural that large components of healthcare systems migrate to them; for example, patient databases can be stored and processed in the cloud. Such cloud services provide enhanced flexibility and additional gains, such as availability, ease of data share, and so on. This trend poses serious threats regarding the privacy of the patients and the trust that an individual must put into the healthcare system itself. Thus, there is a strong need of privacy preservation, achieved through a variety of different approaches. In this paper, we study the application of a random projection-based approach to patient data as a means to achieve two goals: (1) provably mask the identity of users under some adversarial-attack settings, (2) preserve enough information to allow for aggregate data analysis and application of machine-learning techniques. As far as we know, such approaches have not been applied and tested on medical data. We analyze the tradeoff between the loss of accuracy on the outcome of machine-learning algorithms and the resilience against an adversary. We show that random projections proved to be strong against known input/output attacks while offering high quality data, as long as the projected space is smaller than the original space, and as long as the amount of leaked data available to the adversary is limited

    Machine Learning in Population Health: Frequent Emergency Department Utilization Pattern Identification and Prediction

    Get PDF
    Emergency Department (ED) overcrowding is an emerging risk to patient safety and may significantly affect chronically ill people. For instance, overcrowding in an ED may cause delays in patient transportation or revenue loss for hospitals due to hospital diversion. Frequent users with avoidable visits play a significant role in imposing such challenges to ED settings. Non-urgent or "avoidable" ED use induces overcrowding and cost increases due to unnecessary tests and treatment. It is, therefore, valuable to understand the pattern of the ED visits among a population and prospectively identify ED frequent users, to provide stratified care management and resource allocation. Although most current models use classical methods like descriptive analysis or regression modelling, more sophisticated techniques may be needed to increase the accuracy of outcomes where big data is in use. This study focuses on the Machine Learning (ML) techniques to identify the ED usage pattern among frequent users and to evaluate the predicting ability of the models. I performed an extensive literature review to generate a list of potential predictors of ED frequent use. For this thesis, I used Korean Health Panel data from 2008 to 2015. Individuals with at least one ED visit were included, among whom those with four or more visits per year were considered frequent ED users. Demographic and clinical data was collected. The relationship between predictors and ED frequent use was examined through multivariable analysis. A K-modes clustering algorithm was applied to identify ED utilization patterns among frequent users. Finally, the performance of four machine learning classification algorithms was assessed and compared to logistic regression. The classification algorithms used in my thesis were Random Forest, Support Vector Machine (SVM), Bagging, and Voting. The models' performance was evaluated based on Positive Predictive Value (PPV), sensitivity, Area Under Curve (AUC), and classification error. A total of 9,348 individuals with 15,627 ED visits were eligible for this study. Frequent ED users accounted for 2.4% of all ED visits. Frequent ED users tended to be older, male, and more likely to be using ambulance as a mode of transport than non‐frequent ED users. In the cluster analysis, we identified three subgroups among frequent ED users: (i) older patients with respiratory system complaints, the highest discharged rates who were more likely to visit in Spring and Winter, (ii) older patients with the highest rate of hospitalization, who are also more likely to have used ambulance, and visited ED due to circulatory system complaints, (iii) younger patients, mostly female, with the highest rate of ED visits in summer, and lowest rate of using an ambulance, who visited ED mostly due to damages such as injuries, poisoning, etc. The ML classification algorithms predicted frequent ED users with high precision (90% - 98%) and sensitivity (87% - 91%), while showed high AUC scores from 89% for SVM to 96% for Random Forest, as well. The classification error varied among algorithms; logistic regression had the highest classification error (34.9%) while Random Forest had the least (3.8%). According to the Random Forest Importance Score, the top 5 factors predicting frequent users were disease category, age, day of the week, season, and sex. In this thesis, I showed how ML methods applies to ED users in population health. The study results show that ML classification algorithms are robust techniques with predictive power for future ED visit identification and prediction. As more data are collected and the amount of data availability increases, machine learning approaches is a promising tool for advancing the understanding of such ‘Big’ data

    Machine learning approach for credit score analysis : a case study of predicting mortgage loan defaults

    Get PDF
    Dissertation submitted in partial fulfilment of the requirements for the degree of Statistics and Information Management specialized in Risk Analysis and ManagementTo effectively manage credit score analysis, financial institutions instigated techniques and models that are mainly designed for the purpose of improving the process assessing creditworthiness during the credit evaluation process. The foremost objective is to discriminate their clients – borrowers – to fall either in the non-defaulter group, that is more likely to pay their financial obligations, or the defaulter one which has a higher probability of failing to pay their debts. In this paper, we devote to use machine learning models in the prediction of mortgage defaults. This study employs various single classification machine learning methodologies including Logistic Regression, Classification and Regression Trees, Random Forest, K-Nearest Neighbors, and Support Vector Machine. To further improve the predictive power, a meta-algorithm ensemble approach – stacking – will be introduced to combine the outputs – probabilities – of the afore mentioned methods. The sample for this study is solely based on the publicly provided dataset by Freddie Mac. By modelling this approach, we achieve an improvement in the model predictability performance. We then compare the performance of each model, and the meta-learner, by plotting the ROC Curve and computing the AUC rate. This study is an extension of various preceding studies that used different techniques to further enhance the model predictivity. Finally, our results are compared with work from different authors.Para gerir com eficácia a análise de risco de crédito, as instituições financeiras desenvolveram técnicas e modelos que foram projetados principalmente para melhorar o processo de avaliação da qualidade de crédito durante o processo de avaliação de crédito. O objetivo final é classifica os seus clientes - tomadores de empréstimos - entre aqueles que tem maior probabilidade de pagar suas obrigações financeiras, e os potenciais incumpridores que têm maior probabilidade de entrar em default. Neste artigo, nos dedicamos a usar modelos de aprendizado de máquina na previsão de defaults de hipoteca. Este estudo emprega várias metodologias de aprendizado de máquina de classificação única, incluindo Regressão Logística, Classification and Regression Trees, Random Forest, K-Nearest Neighbors, and Support Vector Machine. Para melhorar ainda mais o poder preditivo, a abordagem do conjunto de meta-algoritmos - stacking - será introduzida para combinar as saídas - probabilidades - dos métodos acima mencionados. A amostra deste estudo é baseada exclusivamente no conjunto de dados fornecido publicamente pela Freddie Mac. Ao modelar essa abordagem, alcançamos uma melhoria no desempenho do modelo de previsibilidade. Em seguida, comparamos o desempenho de cada modelo e o meta-aprendiz, plotando a Curva ROC e calculando a taxa de AUC. Este estudo é uma extensão de vários estudos anteriores que usaram diferentes técnicas para melhorar ainda mais o modelo preditivo. Finalmente, nossos resultados são comparados com trabalhos de diferentes autores
    corecore