Search CORE

81,765 research outputs found

Boosting insights in insurance tariff plans with tree-based machine learning methods

Author: Antonio Katrien
Côté Marie-Pier
Henckaerts Roel
Verbelen Roel
Publication venue
Publication date: 02/03/2020
Field of study

Pricing actuaries typically operate within the framework of generalized linear models (GLMs). With the upswing of data analytics, our study puts focus on machine learning methods to develop full tariff plans built from both the frequency and severity of claims. We adapt the loss functions used in the algorithms such that the specific characteristics of insurance data are carefully incorporated: highly unbalanced count data with excess zeros and varying exposure on the frequency side combined with scarce, but potentially long-tailed data on the severity side. A key requirement is the need for transparent and interpretable pricing models which are easily explainable to all stakeholders. We therefore focus on machine learning with decision trees: starting from simple regression trees, we work towards more advanced ensembles such as random forests and boosted trees. We show how to choose the optimal tuning parameters for these models in an elaborate cross-validation scheme, we present visualization tools to obtain insights from the resulting models and the economic value of these new modeling approaches is evaluated. Boosted trees outperform the classical GLMs, allowing the insurer to form profitable portfolios and to guard against potential adverse risk selection

arXiv.org e-Print Archive

Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values

Author: Marko Nicholas
Razzaghi Talayeh
Roderick Oleg
Safro Ilya
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 07/04/2016
Field of study

This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625

arXiv.org e-Print Archive

Directory of Open Access Journals

FigShare

The Potential for Student Performance Prediction in Small Cohorts with Minimal Available Attributes

Author: Ashraf A.
Choi S. P.
Corrin L.
HESA
Heuer H.
Horning N.
Quinlan J. R.
Rienties B.
Rienties B.
Sclater N.
Slade S.
Wakelam E.
Publication venue: 'Wiley'
Publication date: 25/06/2019
Field of study

The measurement of student performance during their progress through university study provides academic leadership with critical information on each student’s likelihood of success. Academics have traditionally used their interactions with individual students through class activities and interim assessments to identify those “at risk” of failure/withdrawal. However, modern university environments, offering easy on-line availability of course material, may see reduced lecture/tutorial attendance, making such identification more challenging. Modern data mining and machine learning techniques provide increasingly accurate predictions of student examination assessment marks, although these approaches have focussed upon large student populations and wide ranges of data attributes per student. However, many university modules comprise relatively small student cohorts, with institutional protocols limiting the student attributes available for analysis. It appears that very little research attention has been devoted to this area of analysis and prediction. We describe an experiment conducted on a final-year university module student cohort of 23, where individual student data are limited to lecture/tutorial attendance, virtual learning environment accesses and intermediate assessments. We found potential for predicting individual student interim and final assessment marks in small student cohorts with very limited attributes and that these predictions could be useful to support module leaders in identifying students potentially “at risk.”.Peer reviewe

Support Vector Machines for Credit Scoring and discovery of significant features

Author: Baesens
Cristianini
Duda
Gayler
Guyon
Hand
Hand
Henley
Huang
Huang
Joachims
Jonathan Crook
Lee
Li
Schebesch
Thomas
Tony Bellotti
Van Gestel
Vapnik
Publication venue: 'Elsevier BV'
Publication date: 02/04/2008
Field of study

The assessment of risk of default on credit is important for financial institutions. Logistic regression and discriminant analysis are techniques traditionally used in credit scoring for determining likelihood to default based on consumer application and credit reference agency data. We test support vector machines against these traditional methods on a large credit card database. We find that they are competitive and can be used as the basis of a feature selection method to discover those features that are most significant in determining risk of default. 1

CiteSeerX