11,340 research outputs found
Model-based Boosting in R: A Hands-on Tutorial Using the R Package mboost
We provide a detailed hands-on tutorial for the R add-on package mboost. The package implements boosting for optimizing general risk functions utilizing component-wise (penalized) least squares estimates as base-learners for fitting various kinds of generalized linear and generalized additive models to potentially high-dimensional data. We give a theoretical background and demonstrate how mboost can be used to fit interpretable models of different complexity. As an example we use mboost to predict the body fat based on anthropometric measurements throughout the tutorial
Interpretable statistics for complex modelling: quantile and topological learning
As the complexity of our data increased exponentially in the last decades, so has our
need for interpretable features. This thesis revolves around two paradigms to approach
this quest for insights.
In the first part we focus on parametric models, where the problem of interpretability
can be seen as a “parametrization selection”. We introduce a quantile-centric
parametrization and we show the advantages of our proposal in the context of regression,
where it allows to bridge the gap between classical generalized linear (mixed)
models and increasingly popular quantile methods.
The second part of the thesis, concerned with topological learning, tackles the
problem from a non-parametric perspective. As topology can be thought of as a way
of characterizing data in terms of their connectivity structure, it allows to represent
complex and possibly high dimensional through few features, such as the number of
connected components, loops and voids. We illustrate how the emerging branch of
statistics devoted to recovering topological structures in the data, Topological Data
Analysis, can be exploited both for exploratory and inferential purposes with a special
emphasis on kernels that preserve the topological information in the data.
Finally, we show with an application how these two approaches can borrow strength
from one another in the identification and description of brain activity through fMRI
data from the ABIDE project
Modeling Persistent Trends in Distributions
We present a nonparametric framework to model a short sequence of probability
distributions that vary both due to underlying effects of sequential
progression and confounding noise. To distinguish between these two types of
variation and estimate the sequential-progression effects, our approach
leverages an assumption that these effects follow a persistent trend. This work
is motivated by the recent rise of single-cell RNA-sequencing experiments over
a brief time course, which aim to identify genes relevant to the progression of
a particular biological process across diverse cell populations. While
classical statistical tools focus on scalar-response regression or
order-agnostic differences between distributions, it is desirable in this
setting to consider both the full distributions as well as the structure
imposed by their ordering. We introduce a new regression model for ordinal
covariates where responses are univariate distributions and the underlying
relationship reflects consistent changes in the distributions over increasing
levels of the covariate. This concept is formalized as a "trend" in
distributions, which we define as an evolution that is linear under the
Wasserstein metric. Implemented via a fast alternating projections algorithm,
our method exhibits numerous strengths in simulations and analyses of
single-cell gene expression data.Comment: To appear in: Journal of the American Statistical Associatio
- …