2,442 research outputs found
Finding Influential Training Samples for Gradient Boosted Decision Trees
We address the problem of finding influential training samples for a
particular case of tree ensemble-based models, e.g., Random Forest (RF) or
Gradient Boosted Decision Trees (GBDT). A natural way of formalizing this
problem is studying how the model's predictions change upon leave-one-out
retraining, leaving out each individual training sample. Recent work has shown
that, for parametric models, this analysis can be conducted in a
computationally efficient way. We propose several ways of extending this
framework to non-parametric GBDT ensembles under the assumption that tree
structures remain fixed. Furthermore, we introduce a general scheme of
obtaining further approximations to our method that balance the trade-off
between performance and computational complexity. We evaluate our approaches on
various experimental setups and use-case scenarios and demonstrate both the
quality of our approach to finding influential training samples in comparison
to the baselines and its computational efficiency.Comment: Added the "Acknowledgements" sectio
Predicting time to graduation at a large enrollment American university
The time it takes a student to graduate with a university degree is mitigated
by a variety of factors such as their background, the academic performance at
university, and their integration into the social communities of the university
they attend. Different universities have different populations, student
services, instruction styles, and degree programs, however, they all collect
institutional data. This study presents data for 160,933 students attending a
large American research university. The data includes performance, enrollment,
demographics, and preparation features. Discrete time hazard models for the
time-to-graduation are presented in the context of Tinto's Theory of Drop Out.
Additionally, a novel machine learning method: gradient boosted trees, is
applied and compared to the typical maximum likelihood method. We demonstrate
that enrollment factors (such as changing a major) lead to greater increases in
model predictive performance of when a student graduates than performance
factors (such as grades) or preparation (such as high school GPA).Comment: 28 pages, 11 figure
Scalable Privacy-Compliant Virality Prediction on Twitter
The digital town hall of Twitter becomes a preferred medium of communication
for individuals and organizations across the globe. Some of them reach
audiences of millions, while others struggle to get noticed. Given the impact
of social media, the question remains more relevant than ever: how to model the
dynamics of attention in Twitter. Researchers around the world turn to machine
learning to predict the most influential tweets and authors, navigating the
volume, velocity, and variety of social big data, with many compromises. In
this paper, we revisit content popularity prediction on Twitter. We argue that
strict alignment of data acquisition, storage and analysis algorithms is
necessary to avoid the common trade-offs between scalability, accuracy and
privacy compliance. We propose a new framework for the rapid acquisition of
large-scale datasets, high accuracy supervisory signal and multilanguage
sentiment prediction while respecting every privacy request applicable. We then
apply a novel gradient boosting framework to achieve state-of-the-art results
in virality ranking, already before including tweet's visual or propagation
features. Our Gradient Boosted Regression Tree is the first to offer
explainable, strong ranking performance on benchmark datasets. Since the
analysis focused on features available early, the model is immediately
applicable to incoming tweets in 18 languages.Comment: AffCon@AAAI-19 Best Paper Award; Presented at AAAI-19 W1: Affective
Content Analysi
Boosting insights in insurance tariff plans with tree-based machine learning methods
Pricing actuaries typically operate within the framework of generalized
linear models (GLMs). With the upswing of data analytics, our study puts focus
on machine learning methods to develop full tariff plans built from both the
frequency and severity of claims. We adapt the loss functions used in the
algorithms such that the specific characteristics of insurance data are
carefully incorporated: highly unbalanced count data with excess zeros and
varying exposure on the frequency side combined with scarce, but potentially
long-tailed data on the severity side. A key requirement is the need for
transparent and interpretable pricing models which are easily explainable to
all stakeholders. We therefore focus on machine learning with decision trees:
starting from simple regression trees, we work towards more advanced ensembles
such as random forests and boosted trees. We show how to choose the optimal
tuning parameters for these models in an elaborate cross-validation scheme, we
present visualization tools to obtain insights from the resulting models and
the economic value of these new modeling approaches is evaluated. Boosted trees
outperform the classical GLMs, allowing the insurer to form profitable
portfolios and to guard against potential adverse risk selection
European day-ahead electricity price forecasting
Dans le contexte de l’augmentation de la part de la production énergétique provenant de sources renouvelables imprévisibles, les prix de l’électricité sont plus volatiles que jamais. Cette volatilité rend la prévision des prix plus difficile mais en même temps de plus grande valeur. Dans cette recherche, une analyse comparative de 8 modèles de prévision est effectuée sur la tâche de prédire les prix de gros de l’électricité du lendemain en France, en Allemagne, en Belgique et aux Pays-Bas. La méthodologie utilisée pour produire les prévisions est expliquée en détail. Les différences de précision des prévisions entre les modèles sont testées pour leur signification statistique. La méthode de gradient boosting a produit les prévisions les plus précises, suivi de près par une méthode d’ensemble.In the context of the increase in the fraction of power generation coming from unpredictable renewable sources, electricity prices are as volatile as ever. This volatility makes forecasting future prices more difficult yet more valuable. In this research, a benchmark of 8 forecasting models is conducted on the task of predicting day-ahead wholesale electricity prices in France, Germany, Belgium and the Netherlands. The methodology used to produce the forecasts is explained in detail. The differences in forecast accuracy between the models are tested for statistical significance. Gradient boosting produced the most accurate forecasts, closely followed by an ensemble method
- …