35 research outputs found

    Boosting insights in insurance tariff plans with tree-based machine learning methods

    Full text link
    Pricing actuaries typically operate within the framework of generalized linear models (GLMs). With the upswing of data analytics, our study puts focus on machine learning methods to develop full tariff plans built from both the frequency and severity of claims. We adapt the loss functions used in the algorithms such that the specific characteristics of insurance data are carefully incorporated: highly unbalanced count data with excess zeros and varying exposure on the frequency side combined with scarce, but potentially long-tailed data on the severity side. A key requirement is the need for transparent and interpretable pricing models which are easily explainable to all stakeholders. We therefore focus on machine learning with decision trees: starting from simple regression trees, we work towards more advanced ensembles such as random forests and boosted trees. We show how to choose the optimal tuning parameters for these models in an elaborate cross-validation scheme, we present visualization tools to obtain insights from the resulting models and the economic value of these new modeling approaches is evaluated. Boosted trees outperform the classical GLMs, allowing the insurer to form profitable portfolios and to guard against potential adverse risk selection

    Modeling the number of hidden events subject to observation delay

    Full text link
    This paper considers the problem of predicting the number of events that have occurred in the past, but which are not yet observed due to a delay. Such delayed events are relevant in predicting the future cost of warranties, pricing maintenance contracts, determining the number of unreported claims in insurance and in modeling the outbreak of diseases. Disregarding these unobserved events results in a systematic underestimation of the event occurrence process. Our approach puts emphasis on modeling the time between the occurrence and observation of the event, the so-called observation delay. We propose a granular model for the heterogeneity in this observation delay based on the occurrence day of the event and on calendar day effects in the observation process, such as weekday and holiday effects. We illustrate this approach on a European general liability insurance data set where the occurrence of an accident is reported to the insurer with delay

    Modelling Censored Losses Using Splicing: a Global Fit Strategy With Mixed Erlang and Extreme Value Distributions

    Full text link
    In risk analysis, a global fit that appropriately captures the body and the tail of the distribution of losses is essential. Modelling the whole range of the losses using a standard distribution is usually very hard and often impossible due to the specific characteristics of the body and the tail of the loss distribution. A possible solution is to combine two distributions in a splicing model: a light-tailed distribution for the body which covers light and moderate losses, and a heavy-tailed distribution for the tail to capture large losses. We propose a splicing model with a mixed Erlang (ME) distribution for the body and a Pareto distribution for the tail. This combines the flexibility of the ME distribution with the ability of the Pareto distribution to model extreme values. We extend our splicing approach for censored and/or truncated data. Relevant examples of such data can be found in financial risk analysis. We illustrate the flexibility of this splicing model using practical examples from risk measurement

    Sparse Regression with Multi-type Regularized Feature Modeling

    Full text link
    Within the statistical and machine learning literature, regularization techniques are often used to construct sparse (predictive) models. Most regularization strategies only work for data where all predictors are treated identically, such as Lasso regression for (continuous) predictors treated as linear effects. However, many predictive problems involve different types of predictors and require a tailored regularization term. We propose a multi-type Lasso penalty that acts on the objective function as a sum of subpenalties, one for each type of predictor. As such, we allow for predictor selection and level fusion within a predictor in a data-driven way, simultaneous with the parameter estimation process. We develop a new estimation strategy for convex predictive models with this multi-type penalty. Using the theory of proximal operators, our estimation procedure is computationally efficient, partitioning the overall optimization problem into easier to solve subproblems, specific for each predictor type and its associated penalty. Earlier research applies approximations to non-differentiable penalties to solve the optimization problem. The proposed SMuRF algorithm removes the need for approximations and achieves a higher accuracy and computational efficiency. This is demonstrated with an extensive simulation study and the analysis of a case-study on insurance pricing analytics

    Modeling the occurrence of events subject to a reporting delay via an EM algorithm

    Full text link
    A delay between the occurrence and the reporting of events often has practical implications such as for the amount of capital to hold for insurance companies, or for taking preventive actions in case of infectious diseases. The accurate estimation of the number of incurred but not (yet) reported events forms an essential part of properly dealing with this phenomenon. We review the current practice for analysing such data and we present a flexible regression framework to jointly estimate the occurrence and reporting of events. By linking this setting to an incomplete data problem, estimation is performed by the expectation-maximization algorithm. The resulting method is elegant, easy to understand and implement, and provides refined insights in the nowcasts. The proposed methodology is applied to a European general liability portfolio in insurance

    Data analytics for insurance loss modelling, telematics pricing and claims reserving.

    No full text
    Today's society generates data more rapidly than ever before, creating many opportunities as well as challenges for statisticians. Many industries become increasingly dependent on high-quality data, and the demand for sound statistical analysis of these data is rising accordingly. In the insurance sector, data have always played a major role. When selling a contract to a client, the insurance company is liable for the claims arising from this contract and will hold capital aside to meet these future liabilities. As such, the insurance premium has to be paid before the real costs are known. This is referred to as the inversion of the production cycle. It implies that the activities of pricing and reserving are strongly interconnected in actuarial practice. On the one hand, pricing actuaries have to determine a fair price for the insurance products they want to sell. Setting the premium levels charged to the insureds is done in a data driven way where statistical models are essential. Risk-based pricing is crucial in a competitive and well-functioning insurance market. On the other hand, an insurance company must safeguard its solvency and reserve capital to fulfill outstanding liabilities. Reserving actuaries thus must predict, with maximum accuracy, the total amount needed to pay claims that the insurer has legally committed himself to cover for. These reserves form the main item on the liability side of the balance sheet of the insurance company and therefore have an important economic impact. The ambition of this research is the development of new, accurate predictive models for the insurance work field. The overall objective is to improve actuarial practices for pricing and reserving by using sound and flexible statistical methods shaped for the actuarial data at hand. This thesis focusses on three related research avenues in the domain of non-life insurance: (1) flexible univariate and multivariate loss modeling in the presence of censoring and truncation, (2) car insurance pricing using telematics data and (3) claims reserving using micro-level data. After an introductory chapter, we study mixtures of Erlang distributions with a common scale parameter in Chapter 2. These distributions form a very versatile, yet analytically tractable, class of distributions making them suitable for loss modeling purposes. We develop a parameter estimation procedure using the EM algorithm that is able to fit a mixture of Erlang distributions under censoring and truncation, which is omnipresent in an actuarial context. Chapter 3 extends the estimation procedure to multivariate mixtures of Erlang distributions. This multivariate distribution generalizes the univariate mixture of Erlang distributions while preserving its flexibility and analytical tractability. When modeling multivariate insurance losses or dependent risks from different portfolios or lines of business, the inherent shape versatility of multivariate mixtures of Erlangs allows one to adequately capture both the marginals and the dependence structure. Moreover, its desirable analytical properties are particularly convenient in a wide variety of insurance related modelling situations. In Chapter 4 we explore the vast potential of telematics insurance from a statistical point of view. We analyze a unique Belgian portfolio of young drivers who signed up for a telematics product. Through telematics technology driving behavior data are collected in between 2010 and 2014 on when, where and how long the insured car is being used. The aim of our contribution is to develop the statistical methodology to incorporate this telematics information in statistical rating models, where we focus on predicting the number of claims, in order to adequately set premium levels based on individual policyholder's driving habits. Chapter 5 presents a new technique to predict the number of incurred but not reported claim counts. Due to time delays between the occurrence of the insured event and the notification of the claim to the insurer, not all of the claims that occurred in the past have been observed when the reserve needs to be calculated. We propose a flexible regression framework to model and jointly estimate the occurrence and reporting of claims on a daily basis. The last chapter concludes our work by presenting several suggestions for future research related to topics covered.status: publishe

    Fitting mixtures of Erlangs to uncensored and untruncated data using the EM algorithm - Addendum

    No full text
    In this addendum, we present the EM algorithm of Lee and Lin (2010) custimized for fitting mixtures of Erlang distributions with a common scale parameter to uncensored and untruncated data. We work out the details with zero-one component indicators inspired by McLachlan and Peel (2001) and Lee and Scott (2012).nrpages: 5status: publishe

    Boosting Insights in Insurance Tariff Plans with Tree-Based Machine Learning Methods

    No full text
    status: Published onlin

    Loss modeling using mixtures of Erlangs

    No full text
    At the 2nd R in insurance conference on July 14, 2014 at Cass Business School in London, Roel Verbelen presented his research findings on a class of flexible distributions called mixtures of Erlangs. In several applications in the context of loss modelling, he demonstrated his implemented fitting procedure and graphical tools built in R.status: publishe

    A data driven binning strategy for the construction of insurance tariff classes

    No full text
    We present a fully data driven strategy to incorporate continuous risk factors and geographical information in an insurance tariff. A framework is developed that aligns exibility with the practical requirements of an insurance company, its policyholders and the regulator. Our strategy is illustrated with an example from property and casualty (P&C) insurance, namely a motor insurance case study. We start by fitting generalized additive models (GAMs) to the number of reported claims and their corresponding severity. These models allow for flexible statistical modeling in the presence of different types of risk factors: categorical, continuous and spatial risk factors. The goal is to bin the continuous and spatial risk factors such that categorical risk factors result which capture the effect of the covariate on the response in an accurate way, while being easy to use in a generalized linear model (GLM). This is in line with the requirement of an insurance company to construct a practical and interpretable tariff that can be explained easily to stakeholders. We propose to bin the spatial risk factor using Fisher's natural breaks algorithm and the continuous risk factors using evolutionary trees. GLMs are fitted to the claims data with the resulting categorical risk factors. We find that the resulting GLMs approximate the original GAMs closely, and lead to a very similar premium structure.status: publishe
    corecore