Search CORE

5,441 research outputs found

The Theory Behind Overfitting, Cross Validation, Regularization, Bagging, and Boosting: Tutorial

Author: Crowley Mark
Ghojogh Benyamin
Publication venue
Publication date: 28/05/2019
Field of study

In this tutorial paper, we first define mean squared error, variance, covariance, and bias of both random variables and classification/predictor models. Then, we formulate the true and generalization errors of the model for both training and validation/test instances where we make use of the Stein's Unbiased Risk Estimator (SURE). We define overfitting, underfitting, and generalization using the obtained true and generalization errors. We introduce cross validation and two well-known examples which are

K

-fold and leave-one-out cross validations. We briefly introduce generalized cross validation and then move on to regularization where we use the SURE again. We work on both

\ell_2

and

\ell_1

norm regularizations. Then, we show that bootstrap aggregating (bagging) reduces the variance of estimation. Boosting, specifically AdaBoost, is introduced and it is explained as both an additive model and a maximum margin model, i.e., Support Vector Machine (SVM). The upper bound on the generalization error of boosting is also provided to show why boosting prevents from overfitting. As examples of regularization, the theory of ridge and lasso regressions, weight decay, noise injection to input/weights, and early stopping are explained. Random forest, dropout, histogram of oriented gradients, and single shot multi-box detector are explained as examples of bagging in machine learning and computer vision. Finally, boosting tree and SVM models are mentioned as examples of boosting.Comment: 23 pages, 9 figure

arXiv.org e-Print Archive

AI Education Matters: Lessons from a Kaggle Click-Through Rate Prediction Competition

Author: Neller Todd W.
Publication venue: The Cupola: Scholarship at Gettysburg College
Publication date: 01/01/2018
Field of study

In this column, we will look at a particular Kaggle.com click-through rate (CTR) prediction competition, observe what the winning entries teach about this part of the machine learning landscape, and then discuss the valuable opportunities and resources this commends to AI educators and their students. [excerpt

Gettysburg College

Model-based Boosting in R: A Hands-on Tutorial Using the R Package mboost

Author: Hofner Benjamin
Mayr Andreas
Robinzonov Nikolay
Schmid Matthias
Publication venue
Publication date: 14/02/2012
Field of study

We provide a detailed hands-on tutorial for the R add-on package mboost. The package implements boosting for optimizing general risk functions utilizing component-wise (penalized) least squares estimates as base-learners for fitting various kinds of generalized linear and generalized additive models to potentially high-dimensional data. We give a theoretical background and demonstrate how mboost can be used to fit interpretable models of different complexity. As an example we use mboost to predict the body fat based on anthropometric measurements throughout the tutorial

Crossref

Open Access LMU

Forecasting Player Behavioral Data and Simulating in-Game Events

Author: A Natekin
AJ Fox
C Bauckhage
Colin Chen
DH Ackley
G Ridgeway
G Schwarz
G Zhang
GE Box
GE Hinton
H Akaike
JG Cragg
JG Gooijer De
JH Friedman
KD Lawrence
L Deng
L Dwyer
M Gilliland
M Längkvist
MS El-Nasr
N Srivastava
NE Breslow
PH Eilers
PJ Brockwell
RJ Hyndman
S Asmussen
S Hochreiter
S Makridakis
SN Wood
SN Wood
SN Wood
T Hastie
T Zhang
TJ Hastie
Y Bengio
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 05/10/2017
Field of study

Understanding player behavior is fundamental in game data science. Video games evolve as players interact with the game, so being able to foresee player experience would help to ensure a successful game development. In particular, game developers need to evaluate beforehand the impact of in-game events. Simulation optimization of these events is crucial to increase player engagement and maximize monetization. We present an experimental analysis of several methods to forecast game-related variables, with two main aims: to obtain accurate predictions of in-app purchases and playtime in an operational production environment, and to perform simulations of in-game events in order to maximize sales and playtime. Our ultimate purpose is to take a step towards the data-driven development of games. The results suggest that, even though the performance of traditional approaches such as ARIMA is still better, the outcomes of state-of-the-art techniques like deep learning are promising. Deep learning comes up as a well-suited general model that could be used to forecast a variety of time series with different dynamic behaviors

arXiv.org e-Print Archive

Crossref

Tree Boosting Data Competitions with XGBoost

Author: Bort Escabias Carlos
Publication venue: 'Edicions de la Universitat de Barcelona'
Publication date: 01/01/2017
Field of study

This Master's Degree Thesis objective is to provide understanding on how to approach a supervised learning predictive problem and illustrate it using a statistical/machine learning algorithm, Tree Boosting. A review of tree methodology is introduced in order to understand its evolution, since Classification and Regression Trees, followed by Bagging, Random Forest and, nowadays, Tree Boosting. The methodology is explained following the XGBoost implementation, which achieved state-of-the-art results in several data competitions. A framework for applied predictive modelling is explained with its proper concepts: objective function, regularization term, overfitting, hyperparameter tuning, k-fold cross validation and feature engineering. All these concepts are illustrated with a real dataset of videogame churn; used in a datathon competition

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Boosting the concordance index for survival data - a unified framework to derive and evaluate biomarker combinations

Author: Mayr Andreas
Schmid Matthias
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 25/10/2013
Field of study

The development of molecular signatures for the prediction of time-to-event outcomes is a methodologically challenging task in bioinformatics and biostatistics. Although there are numerous approaches for the derivation of marker combinations and their evaluation, the underlying methodology often suffers from the problem that different optimization criteria are mixed during the feature selection, estimation and evaluation steps. This might result in marker combinations that are only suboptimal regarding the evaluation criterion of interest. To address this issue, we propose a unified framework to derive and evaluate biomarker combinations. Our approach is based on the concordance index for time-to-event data, which is a non-parametric measure to quantify the discrimatory power of a prediction rule. Specifically, we propose a component-wise boosting algorithm that results in linear biomarker combinations that are optimal with respect to a smoothed version of the concordance index. We investigate the performance of our algorithm in a large-scale simulation study and in two molecular data sets for the prediction of survival in breast cancer patients. Our numerical results show that the new approach is not only methodologically sound but can also lead to a higher discriminatory power than traditional approaches for the derivation of gene signatures.Comment: revised manuscript - added simulation study, additional result

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

Open Access LMU

PubMed Central

FigShare

Analyzing First-Person Stories Based on Socializing, Eating and Sedentary Patterns

Author: A Cartas
A Natekin
A Torralba
AR Doherty
BC Russell
CJ Burges
E Talavera
F Pedregosa
M Bolanos
M Dimiccoli
N Srivastava
O Kramer
O Russakovsky
Publication venue
Publication date: 25/07/2017
Field of study

First-person stories can be analyzed by means of egocentric pictures acquired throughout the whole active day with wearable cameras. This manuscript presents an egocentric dataset with more than 45,000 pictures from four people in different environments such as working or studying. All the images were manually labeled to identify three patterns of interest regarding people's lifestyle: socializing, eating and sedentary. Additionally, two different approaches are proposed to classify egocentric images into one of the 12 target categories defined to characterize these three patterns. The approaches are based on machine learning and deep learning techniques, including traditional classifiers and state-of-art convolutional neural networks. The experimental results obtained when applying these methods to the egocentric dataset demonstrated their adequacy for the problem at hand.Comment: Accepted at First International Workshop on Social Signal Processing and Beyond, 19th International Conference on Image Analysis and Processing (ICIAP), September 201

arXiv.org e-Print Archive

Crossref

Analyzing First-Person Stories Based on Socializing, Eating and Sedentary Patterns

Author: Herruzo Pedro
Portell Laura
Soto Alberto
Remeseiro Beatriz
Publication venue
Publication date: 01/03/2001
Field of study

arXiv.org e-Print Archive

Kansai Gaidai University Repository