35 research outputs found
Boosting insights in insurance tariff plans with tree-based machine learning methods
Pricing actuaries typically operate within the framework of generalized
linear models (GLMs). With the upswing of data analytics, our study puts focus
on machine learning methods to develop full tariff plans built from both the
frequency and severity of claims. We adapt the loss functions used in the
algorithms such that the specific characteristics of insurance data are
carefully incorporated: highly unbalanced count data with excess zeros and
varying exposure on the frequency side combined with scarce, but potentially
long-tailed data on the severity side. A key requirement is the need for
transparent and interpretable pricing models which are easily explainable to
all stakeholders. We therefore focus on machine learning with decision trees:
starting from simple regression trees, we work towards more advanced ensembles
such as random forests and boosted trees. We show how to choose the optimal
tuning parameters for these models in an elaborate cross-validation scheme, we
present visualization tools to obtain insights from the resulting models and
the economic value of these new modeling approaches is evaluated. Boosted trees
outperform the classical GLMs, allowing the insurer to form profitable
portfolios and to guard against potential adverse risk selection
Modeling the number of hidden events subject to observation delay
This paper considers the problem of predicting the number of events that have
occurred in the past, but which are not yet observed due to a delay. Such
delayed events are relevant in predicting the future cost of warranties,
pricing maintenance contracts, determining the number of unreported claims in
insurance and in modeling the outbreak of diseases. Disregarding these
unobserved events results in a systematic underestimation of the event
occurrence process. Our approach puts emphasis on modeling the time between the
occurrence and observation of the event, the so-called observation delay. We
propose a granular model for the heterogeneity in this observation delay based
on the occurrence day of the event and on calendar day effects in the
observation process, such as weekday and holiday effects. We illustrate this
approach on a European general liability insurance data set where the
occurrence of an accident is reported to the insurer with delay
Modelling Censored Losses Using Splicing: a Global Fit Strategy With Mixed Erlang and Extreme Value Distributions
In risk analysis, a global fit that appropriately captures the body and the
tail of the distribution of losses is essential. Modelling the whole range of
the losses using a standard distribution is usually very hard and often
impossible due to the specific characteristics of the body and the tail of the
loss distribution. A possible solution is to combine two distributions in a
splicing model: a light-tailed distribution for the body which covers light and
moderate losses, and a heavy-tailed distribution for the tail to capture large
losses. We propose a splicing model with a mixed Erlang (ME) distribution for
the body and a Pareto distribution for the tail. This combines the flexibility
of the ME distribution with the ability of the Pareto distribution to model
extreme values. We extend our splicing approach for censored and/or truncated
data. Relevant examples of such data can be found in financial risk analysis.
We illustrate the flexibility of this splicing model using practical examples
from risk measurement
Sparse Regression with Multi-type Regularized Feature Modeling
Within the statistical and machine learning literature, regularization
techniques are often used to construct sparse (predictive) models. Most
regularization strategies only work for data where all predictors are treated
identically, such as Lasso regression for (continuous) predictors treated as
linear effects. However, many predictive problems involve different types of
predictors and require a tailored regularization term. We propose a multi-type
Lasso penalty that acts on the objective function as a sum of subpenalties, one
for each type of predictor. As such, we allow for predictor selection and level
fusion within a predictor in a data-driven way, simultaneous with the parameter
estimation process. We develop a new estimation strategy for convex predictive
models with this multi-type penalty. Using the theory of proximal operators,
our estimation procedure is computationally efficient, partitioning the overall
optimization problem into easier to solve subproblems, specific for each
predictor type and its associated penalty. Earlier research applies
approximations to non-differentiable penalties to solve the optimization
problem. The proposed SMuRF algorithm removes the need for approximations and
achieves a higher accuracy and computational efficiency. This is demonstrated
with an extensive simulation study and the analysis of a case-study on
insurance pricing analytics
Modeling the occurrence of events subject to a reporting delay via an EM algorithm
A delay between the occurrence and the reporting of events often has
practical implications such as for the amount of capital to hold for insurance
companies, or for taking preventive actions in case of infectious diseases. The
accurate estimation of the number of incurred but not (yet) reported events
forms an essential part of properly dealing with this phenomenon. We review the
current practice for analysing such data and we present a flexible regression
framework to jointly estimate the occurrence and reporting of events. By
linking this setting to an incomplete data problem, estimation is performed by
the expectation-maximization algorithm. The resulting method is elegant, easy
to understand and implement, and provides refined insights in the nowcasts. The
proposed methodology is applied to a European general liability portfolio in
insurance
Data analytics for insurance loss modelling, telematics pricing and claims reserving.
Today's society generates data more rapidly than ever before, creating many opportunities as well as challenges for statisticians. Many industries become increasingly dependent on high-quality data, and the demand for sound statistical analysis of these data is rising accordingly.
In the insurance sector, data have always played a major role. When selling a contract to a client, the insurance company is liable for the claims arising from this contract and will hold capital aside to meet these future liabilities. As such, the insurance premium has to be paid before the real costs are known. This is referred to as the inversion of the production cycle. It implies that the activities of pricing and reserving are strongly interconnected in actuarial practice. On the one hand, pricing actuaries have to determine a fair price for the insurance products they want to sell. Setting the premium levels charged to the insureds is done in a data driven way where statistical models are essential. Risk-based pricing is crucial in a competitive and well-functioning insurance market. On the other hand, an insurance company must safeguard its solvency and reserve capital to fulfill outstanding liabilities. Reserving actuaries thus must predict, with maximum accuracy, the total amount needed to pay claims that the insurer has legally committed himself to cover for. These reserves form the main item on the liability side of the balance sheet of the insurance company and therefore have an important economic impact.
The ambition of this research is the development of new, accurate predictive models for the insurance work field. The overall objective is to improve actuarial practices for pricing and reserving by using sound and flexible statistical methods shaped for the actuarial data at hand. This thesis focusses on three related research avenues in the domain of non-life insurance: (1) flexible univariate and multivariate loss modeling in the presence of censoring and truncation, (2) car insurance pricing using telematics data and (3) claims reserving using micro-level data.
After an introductory chapter, we study mixtures of Erlang distributions with a common scale parameter in Chapter 2. These distributions form a very versatile, yet analytically tractable, class of distributions making them suitable for loss modeling purposes. We develop a parameter estimation procedure using the EM algorithm that is able to fit a mixture of Erlang distributions under censoring and truncation, which is omnipresent in an actuarial context.
Chapter 3 extends the estimation procedure to multivariate mixtures of Erlang distributions. This multivariate distribution generalizes the univariate mixture of Erlang distributions while preserving its flexibility and analytical tractability. When modeling multivariate insurance losses or dependent risks from different portfolios or lines of business, the inherent shape versatility of multivariate mixtures of Erlangs allows one to adequately capture both the marginals and the dependence structure. Moreover, its desirable analytical properties are particularly convenient in a wide variety of insurance related modelling situations.
In Chapter 4 we explore the vast potential of telematics insurance from a statistical point of view. We analyze a unique Belgian portfolio of young drivers who signed up for a telematics product. Through telematics technology driving behavior data are collected in between 2010 and 2014 on when, where and how long the insured car is being used. The aim of our contribution is to develop the statistical methodology to incorporate this telematics information in statistical rating models, where we focus on predicting the number of claims, in order to adequately set premium levels based on individual policyholder's driving habits.
Chapter 5 presents a new technique to predict the number of incurred but not reported claim counts. Due to time delays between the occurrence of the insured event and the notification of the claim to the insurer, not all of the claims that occurred in the past have been observed when the reserve needs to be calculated. We propose a flexible regression framework to model and jointly estimate the occurrence and reporting of claims on a daily basis.
The last chapter concludes our work by presenting several suggestions for future research related to topics covered.status: publishe
Fitting mixtures of Erlangs to uncensored and untruncated data using the EM algorithm - Addendum
In this addendum, we present the EM algorithm of Lee and Lin (2010) custimized for fitting mixtures of Erlang distributions with a common scale parameter to uncensored and untruncated data. We work out the details with zero-one component indicators inspired by McLachlan and Peel (2001) and Lee and Scott (2012).nrpages: 5status: publishe
Boosting Insights in Insurance Tariff Plans with Tree-Based Machine Learning Methods
status: Published onlin
Loss modeling using mixtures of Erlangs
At the 2nd R in insurance conference on July 14, 2014 at Cass Business School in London, Roel Verbelen presented his research findings on a class of flexible distributions called mixtures of Erlangs. In several applications in the context of loss modelling, he demonstrated his implemented fitting procedure and graphical tools built in R.status: publishe
A data driven binning strategy for the construction of insurance tariff classes
We present a fully data driven strategy to incorporate continuous risk factors and geographical information in an insurance tariff. A framework is developed that aligns
exibility with the practical requirements of an insurance company, its policyholders and the regulator. Our strategy is illustrated with an example from property and casualty (P&C) insurance, namely a motor insurance case study. We start by fitting generalized additive models (GAMs) to the number of reported claims and their corresponding severity. These models allow for flexible
statistical modeling in the presence of different types of risk factors: categorical, continuous and spatial risk factors. The goal is to bin the continuous and spatial risk factors such that categorical risk factors result which capture the effect of the covariate on the response in an accurate way, while being easy to use in a generalized linear model (GLM). This is in line with the requirement of an insurance company to construct a practical and interpretable tariff that can be explained easily to stakeholders. We propose to bin the spatial risk factor using Fisher's natural breaks algorithm and the continuous risk factors using evolutionary trees. GLMs are fitted to the claims data with the resulting categorical risk factors. We find that the resulting GLMs approximate the original GAMs closely, and lead to a very similar premium structure.status: publishe