1,513 research outputs found
Functional Regression
Functional data analysis (FDA) involves the analysis of data whose ideal
units of observation are functions defined on some continuous domain, and the
observed data consist of a sample of functions taken from some population,
sampled on a discrete grid. Ramsay and Silverman's 1997 textbook sparked the
development of this field, which has accelerated in the past 10 years to become
one of the fastest growing areas of statistics, fueled by the growing number of
applications yielding this type of data. One unique characteristic of FDA is
the need to combine information both across and within functions, which Ramsay
and Silverman called replication and regularization, respectively. This article
will focus on functional regression, the area of FDA that has received the most
attention in applications and methodological development. First will be an
introduction to basis functions, key building blocks for regularization in
functional regression methods, followed by an overview of functional regression
methods, split into three types: [1] functional predictor regression
(scalar-on-function), [2] functional response regression (function-on-scalar)
and [3] function-on-function regression. For each, the role of replication and
regularization will be discussed and the methodological development described
in a roughly chronological manner, at times deviating from the historical
timeline to group together similar methods. The primary focus is on modeling
and methodology, highlighting the modeling structures that have been developed
and the various regularization approaches employed. At the end is a brief
discussion describing potential areas of future development in this field
A Framework for Unbiased Model Selection Based on Boosting
Variable selection and model choice are of major concern in many statistical applications, especially in high-dimensional regression models. Boosting is a convenient statistical method that combines model fitting with intrinsic model selection.
We investigate the impact of base-learner specification on the performance of boosting as a model selection procedure.
We show that variable selection may be biased if the covariates are of different nature.
Important examples are models combining continuous and categorical covariates, especially if the number of categories is large. In this case, least squares base-learners offer increased flexibility for the categorical covariate and lead to a preference even if the categorical covariate is non-informative.
Similar difficulties arise when comparing linear and nonlinear base-learners for a continuous covariate. The additional flexibility in the nonlinear base-learner again yields a preference of the more complex modeling alternative.
We investigate these problems from a theoretical perspective and suggest a framework for unbiased model selection based on a general class of penalized least squares base-learners.
Making all base-learners comparable in terms of their degrees of freedom strongly reduces the selection bias observed in naive boosting specifications. The importance of unbiased model selection is demonstrated in simulations and an application to forest health models
Model-Based Clustering and Classification of Functional Data
The problem of complex data analysis is a central topic of modern statistical
science and learning systems and is becoming of broader interest with the
increasing prevalence of high-dimensional data. The challenge is to develop
statistical models and autonomous algorithms that are able to acquire knowledge
from raw data for exploratory analysis, which can be achieved through
clustering techniques or to make predictions of future data via classification
(i.e., discriminant analysis) techniques. Latent data models, including mixture
model-based approaches are one of the most popular and successful approaches in
both the unsupervised context (i.e., clustering) and the supervised one (i.e,
classification or discrimination). Although traditionally tools of multivariate
analysis, they are growing in popularity when considered in the framework of
functional data analysis (FDA). FDA is the data analysis paradigm in which the
individual data units are functions (e.g., curves, surfaces), rather than
simple vectors. In many areas of application, the analyzed data are indeed
often available in the form of discretized values of functions or curves (e.g.,
time series, waveforms) and surfaces (e.g., 2d-images, spatio-temporal data).
This functional aspect of the data adds additional difficulties compared to the
case of a classical multivariate (non-functional) data analysis. We review and
present approaches for model-based clustering and classification of functional
data. We derive well-established statistical models along with efficient
algorithmic tools to address problems regarding the clustering and the
classification of these high-dimensional data, including their heterogeneity,
missing information, and dynamical hidden structure. The presented models and
algorithms are illustrated on real-world functional data analysis problems from
several application area
Bayesian semiparametric inference for multivariate doubly-interval-censored data
Based on a data set obtained in a dental longitudinal study, conducted in
Flanders (Belgium), the joint time to caries distribution of permanent first
molars was modeled as a function of covariates. This involves an analysis of
multivariate continuous doubly-interval-censored data since: (i) the emergence
time of a tooth and the time it experiences caries were recorded yearly, and
(ii) events on teeth of the same child are dependent. To model the joint
distribution of the emergence times and the times to caries, we propose a
dependent Bayesian semiparametric model. A major feature of the proposed
approach is that survival curves can be estimated without imposing assumptions
such as proportional hazards, additive hazards, proportional odds or
accelerated failure time.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS368 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Parametric Regression on the Grassmannian
We address the problem of fitting parametric curves on the Grassmann manifold
for the purpose of intrinsic parametric regression. As customary in the
literature, we start from the energy minimization formulation of linear
least-squares in Euclidean spaces and generalize this concept to general
nonflat Riemannian manifolds, following an optimal-control point of view. We
then specialize this idea to the Grassmann manifold and demonstrate that it
yields a simple, extensible and easy-to-implement solution to the parametric
regression problem. In fact, it allows us to extend the basic geodesic model to
(1) a time-warped variant and (2) cubic splines. We demonstrate the utility of
the proposed solution on different vision problems, such as shape regression as
a function of age, traffic-speed estimation and crowd-counting from
surveillance video clips. Most notably, these problems can be conveniently
solved within the same framework without any specifically-tailored steps along
the processing pipeline.Comment: 14 pages, 11 figure
- …