Search CORE

25,489 research outputs found

Boosting insights in insurance tariff plans with tree-based machine learning methods

Author: Antonio Katrien
Côté Marie-Pier
Henckaerts Roel
Verbelen Roel
Publication venue
Publication date: 02/03/2020
Field of study

Pricing actuaries typically operate within the framework of generalized linear models (GLMs). With the upswing of data analytics, our study puts focus on machine learning methods to develop full tariff plans built from both the frequency and severity of claims. We adapt the loss functions used in the algorithms such that the specific characteristics of insurance data are carefully incorporated: highly unbalanced count data with excess zeros and varying exposure on the frequency side combined with scarce, but potentially long-tailed data on the severity side. A key requirement is the need for transparent and interpretable pricing models which are easily explainable to all stakeholders. We therefore focus on machine learning with decision trees: starting from simple regression trees, we work towards more advanced ensembles such as random forests and boosted trees. We show how to choose the optimal tuning parameters for these models in an elaborate cross-validation scheme, we present visualization tools to obtain insights from the resulting models and the economic value of these new modeling approaches is evaluated. Boosted trees outperform the classical GLMs, allowing the insurer to form profitable portfolios and to guard against potential adverse risk selection

arXiv.org e-Print Archive

Lirias

International Migration, Integration and Social Cohesion online publications

VPRSM Based Decision Tree Classifier

Author: Liu Da You
Wang Ming Yang
Wang Shu Qin
Wei Jin Mao
You Jun Ping
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 30/01/2012
Field of study

A new approach for inducing decision trees is proposed based on the Variable Precision Rough Set Model. From the rough set theory point of view, in the process of inducing decision trees with evaluations of candidate attributes, some methods based on purity measurements, such as information entropy based methods, emphasize the effect of class distribution. The more unbalanced the class distribution is, the more favorable it is. The rough set based approaches emphasize the effect of certainty. The more certain it is, the better. The criterion for node selection in the new method is based on the measurement of the variable precision explicit regions corresponding to candidate attributes. We compared the presented approach with C4.5 on some data sets from the UCI machine learning repository, which instantiates the feasibility of the proposed method

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Random Forests for Big Data

Author: Genuer Robin
Poggi Jean-Michel
Tuleau-Malot Christine
Villa-Vialaneix Nathalie
Publication venue
Publication date: 19/11/2015
Field of study

Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how related quantities -- such as out-of-bag error and variable importance -- are addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or to "divide-and-conquer" approaches. The fifth variant relates on online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

ProdInra

Hal-Diderot

Customer churn prediction in telecom using machine learning and social network analysis in big data platform

Author: Ahmad Abdelrahim Kasem
Aljoumaa Kadan
Jafar Assef
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/03/2019
Field of study

Customer churn is a major problem and one of the most important concerns for large companies. Due to the direct effect on the revenues of the companies, especially in the telecom field, companies are seeking to develop means to predict potential customer to churn. Therefore, finding factors that increase customer churn is important to take necessary actions to reduce this churn. The main contribution of our work is to develop a churn prediction model which assists telecom operators to predict customers who are most likely subject to churn. The model developed in this work uses machine learning techniques on big data platform and builds a new way of features' engineering and selection. In order to measure the performance of the model, the Area Under Curve (AUC) standard measure is adopted, and the AUC value obtained is 93.3%. Another main contribution is to use customer social network in the prediction model by extracting Social Network Analysis (SNA) features. The use of SNA enhanced the performance of the model from 84 to 93.3% against AUC standard. The model was prepared and tested through Spark environment by working on a large dataset created by transforming big raw data provided by SyriaTel telecom company. The dataset contained all customers' information over 9 months, and was used to train, test, and evaluate the system at SyriaTel. The model experimented four algorithms: Decision Tree, Random Forest, Gradient Boosted Machine Tree "GBM" and Extreme Gradient Boosting "XGBOOST". However, the best results were obtained by applying XGBOOST algorithm. This algorithm was used for classification in this churn predictive model.Comment: 24 pages, 14 figures. PDF https://rdcu.be/budK

arXiv.org e-Print Archive

Directory of Open Access Journals

An AUC-based Permutation Variable Importance Measure for Random Forests

Author: A Estabrooks
AL Boulesteix
AL Boulesteix
Anne-Laure Boulesteix
C Chen
C Liu
C Strobl
Carolin Strobl
F Briggs
G Batista
J Chang
J Van Hulse
J Van Hulse
K Nicodemus
KK Nicodemus
KK Nicodemus
KK Nicodemus
L Breiman
M Calle
M Cummings
M Khalilia
M Kubat
M Pepe
N Japkowicz
R Blagus
Silke Janitza
T Fawcett
T Hothorn
T Hothorn
T Khoshgoftaar
WJ Lin
Y Huang
Y Sun
Y Xie
Publication venue
Publication date: 01/11/2012
Field of study

The random forest (RF) method is a commonly used tool for classification with high dimensional data as well as for ranking candidate predictors based on the so-called random forest variable importance measures (VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced data, i.e. data where response class sizes differ considerably. Suggestions were made to obtain better classification performance based either on sampling procedures or on cost sensitivity analyses. However to our knowledge the performance of the VIMs has not yet been examined in the case of unbalanced response classes. In this paper we explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class imbalance. We investigated the performance of the standard permutation VIM and of our novel AUC-based permutation VIM for different class imbalance levels using simulated data and real data. The results suggest that the standard permutation VIM loses its ability to discriminate between associated predictors and predictors not associated with the response for increasing class imbalance. It is outperformed by our new AUC-based permutation VIM for unbalanced data settings, while the performance of both VIMs is very similar in the case of balanced classes. The new AUC-based VIM is implemented in the R package party for the unbiased RF variant based on conditional inference trees. The codes implementing our study are available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html

CiteSeerX

Crossref

Springer - Publisher Connector

Open Access LMU

PubMed Central

ZORA

Passport: Enabling Accurate Country-Level Router Geolocation using Inaccurate Sources

Author: Choffnes David
Goldberg Sharon
Rehman Muzammil Abdul
Publication venue
Publication date: 23/07/2019
Field of study

When does Internet traffic cross international borders? This question has major geopolitical, legal and social implications and is surprisingly difficult to answer. A critical stumbling block is a dearth of tools that accurately map routers traversed by Internet traffic to the countries in which they are located. This paper presents Passport: a new approach for efficient, accurate country-level router geolocation and a system that implements it. Passport provides location predictions with limited active measurements, using machine learning to combine information from IP geolocation databases, router hostnames, whois records, and ping measurements. We show that Passport substantially outperforms existing techniques, and identify cases where paths traverse countries with implications for security, privacy, and performance

arXiv.org e-Print Archive

Boston University Institutional Repository (OpenBU)

Passport: enabling accurate country-level router geolocation using inaccurate sources

Author: Choffnes David
Goldberg Sharon
Rehman Muzammil Abdul
Publication venue
Publication date: 30/01/2020
Field of study

Boston University Institutional Repository (OpenBU)

Training Big Random Forests with Little Resources

Author: Beazley David M.
Coates Adam
Lakshminarayanan Balaji
Murphy Kevin P.
Toby Sharp
Publication venue
Publication date: 01/01/2018
Field of study

Without access to large compute clusters, building random forests on large datasets is still a challenging problem. This is, in particular, the case if fully-grown trees are desired. We propose a simple yet effective framework that allows to efficiently construct ensembles of huge trees for hundreds of millions or even billions of training instances using a cheap desktop computer with commodity hardware. The basic idea is to consider a multi-level construction scheme, which builds top trees for small random subsets of the available data and which subsequently distributes all training instances to the top trees' leaves for further processing. While being conceptually simple, the overall efficiency crucially depends on the particular implementation of the different phases. The practical merits of our approach are demonstrated using dense datasets with hundreds of millions of training instances.Comment: 9 pages, 9 Figure

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

Random Forest as a tumour genetic marker extractor

Author: Pérez Arnal Raquel Leandra
Publication venue: Universitat Politécnica de Catalunya
Publication date: 01/01/2019
Field of study

Identifying tumour genetic markers is an essential task for biomedicine. In this thesis, we analyse a dataset of chromosomal rearrangements of cancer samples and present a methodology for extracting genetic markers from this dataset by using a Random Forest as a feature selection tool

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC