5,441 research outputs found
The Theory Behind Overfitting, Cross Validation, Regularization, Bagging, and Boosting: Tutorial
In this tutorial paper, we first define mean squared error, variance,
covariance, and bias of both random variables and classification/predictor
models. Then, we formulate the true and generalization errors of the model for
both training and validation/test instances where we make use of the Stein's
Unbiased Risk Estimator (SURE). We define overfitting, underfitting, and
generalization using the obtained true and generalization errors. We introduce
cross validation and two well-known examples which are -fold and
leave-one-out cross validations. We briefly introduce generalized cross
validation and then move on to regularization where we use the SURE again. We
work on both and norm regularizations. Then, we show that
bootstrap aggregating (bagging) reduces the variance of estimation. Boosting,
specifically AdaBoost, is introduced and it is explained as both an additive
model and a maximum margin model, i.e., Support Vector Machine (SVM). The upper
bound on the generalization error of boosting is also provided to show why
boosting prevents from overfitting. As examples of regularization, the theory
of ridge and lasso regressions, weight decay, noise injection to input/weights,
and early stopping are explained. Random forest, dropout, histogram of oriented
gradients, and single shot multi-box detector are explained as examples of
bagging in machine learning and computer vision. Finally, boosting tree and SVM
models are mentioned as examples of boosting.Comment: 23 pages, 9 figure
AI Education Matters: Lessons from a Kaggle Click-Through Rate Prediction Competition
In this column, we will look at a particular Kaggle.com click-through rate (CTR) prediction competition, observe what the winning entries teach about this part of the machine learning landscape, and then discuss the valuable opportunities and resources this commends to AI educators and their students. [excerpt
Model-based Boosting in R: A Hands-on Tutorial Using the R Package mboost
We provide a detailed hands-on tutorial for the R add-on package mboost. The package implements boosting for optimizing general risk functions utilizing component-wise (penalized) least squares estimates as base-learners for fitting various kinds of generalized linear and generalized additive models to potentially high-dimensional data. We give a theoretical background and demonstrate how mboost can be used to fit interpretable models of different complexity. As an example we use mboost to predict the body fat based on anthropometric measurements throughout the tutorial
Forecasting Player Behavioral Data and Simulating in-Game Events
Understanding player behavior is fundamental in game data science. Video
games evolve as players interact with the game, so being able to foresee player
experience would help to ensure a successful game development. In particular,
game developers need to evaluate beforehand the impact of in-game events.
Simulation optimization of these events is crucial to increase player
engagement and maximize monetization. We present an experimental analysis of
several methods to forecast game-related variables, with two main aims: to
obtain accurate predictions of in-app purchases and playtime in an operational
production environment, and to perform simulations of in-game events in order
to maximize sales and playtime. Our ultimate purpose is to take a step towards
the data-driven development of games. The results suggest that, even though the
performance of traditional approaches such as ARIMA is still better, the
outcomes of state-of-the-art techniques like deep learning are promising. Deep
learning comes up as a well-suited general model that could be used to forecast
a variety of time series with different dynamic behaviors
Tree Boosting Data Competitions with XGBoost
This Master's Degree Thesis objective is to provide understanding on how to approach a supervised learning predictive problem and illustrate it using a statistical/machine learning algorithm, Tree Boosting. A review of tree methodology is introduced in order to understand its evolution, since Classification and Regression Trees, followed by Bagging, Random Forest and, nowadays, Tree Boosting. The methodology is explained following the XGBoost implementation, which achieved state-of-the-art results in several data competitions. A framework for applied predictive modelling is explained with its proper concepts: objective function, regularization term, overfitting, hyperparameter tuning, k-fold cross validation and feature engineering. All these concepts are illustrated with a real dataset of videogame churn; used in a datathon competition
Boosting the concordance index for survival data - a unified framework to derive and evaluate biomarker combinations
The development of molecular signatures for the prediction of time-to-event
outcomes is a methodologically challenging task in bioinformatics and
biostatistics. Although there are numerous approaches for the derivation of
marker combinations and their evaluation, the underlying methodology often
suffers from the problem that different optimization criteria are mixed during
the feature selection, estimation and evaluation steps. This might result in
marker combinations that are only suboptimal regarding the evaluation criterion
of interest. To address this issue, we propose a unified framework to derive
and evaluate biomarker combinations. Our approach is based on the concordance
index for time-to-event data, which is a non-parametric measure to quantify the
discrimatory power of a prediction rule. Specifically, we propose a
component-wise boosting algorithm that results in linear biomarker combinations
that are optimal with respect to a smoothed version of the concordance index.
We investigate the performance of our algorithm in a large-scale simulation
study and in two molecular data sets for the prediction of survival in breast
cancer patients. Our numerical results show that the new approach is not only
methodologically sound but can also lead to a higher discriminatory power than
traditional approaches for the derivation of gene signatures.Comment: revised manuscript - added simulation study, additional result
Analyzing First-Person Stories Based on Socializing, Eating and Sedentary Patterns
First-person stories can be analyzed by means of egocentric pictures acquired
throughout the whole active day with wearable cameras. This manuscript presents
an egocentric dataset with more than 45,000 pictures from four people in
different environments such as working or studying. All the images were
manually labeled to identify three patterns of interest regarding people's
lifestyle: socializing, eating and sedentary. Additionally, two different
approaches are proposed to classify egocentric images into one of the 12 target
categories defined to characterize these three patterns. The approaches are
based on machine learning and deep learning techniques, including traditional
classifiers and state-of-art convolutional neural networks. The experimental
results obtained when applying these methods to the egocentric dataset
demonstrated their adequacy for the problem at hand.Comment: Accepted at First International Workshop on Social Signal Processing
and Beyond, 19th International Conference on Image Analysis and Processing
(ICIAP), September 201
Analyzing First-Person Stories Based on Socializing, Eating and Sedentary Patterns
First-person stories can be analyzed by means of egocentric pictures acquired
throughout the whole active day with wearable cameras. This manuscript presents
an egocentric dataset with more than 45,000 pictures from four people in
different environments such as working or studying. All the images were
manually labeled to identify three patterns of interest regarding people's
lifestyle: socializing, eating and sedentary. Additionally, two different
approaches are proposed to classify egocentric images into one of the 12 target
categories defined to characterize these three patterns. The approaches are
based on machine learning and deep learning techniques, including traditional
classifiers and state-of-art convolutional neural networks. The experimental
results obtained when applying these methods to the egocentric dataset
demonstrated their adequacy for the problem at hand.Comment: Accepted at First International Workshop on Social Signal Processing
and Beyond, 19th International Conference on Image Analysis and Processing
(ICIAP), September 201
- …