9 research outputs found
An Alternative Approach to Functional Linear Partial Quantile Regression
We have previously proposed the partial quantile regression (PQR) prediction
procedure for functional linear model by using partial quantile covariance
techniques and developed the simple partial quantile regression (SIMPQR)
algorithm to efficiently extract PQR basis for estimating functional
coefficients. However, although the PQR approach is considered as an attractive
alternative to projections onto the principal component basis, there are
certain limitations to uncovering the corresponding asymptotic properties
mainly because of its iterative nature and the non-differentiability of the
quantile loss function. In this article, we propose and implement an
alternative formulation of partial quantile regression (APQR) for functional
linear model by using block relaxation method and finite smoothing techniques.
The proposed reformulation leads to insightful results and motivates new
theory, demonstrating consistency and establishing convergence rates by
applying advanced techniques from empirical process theory. Two simulations and
two real data from ADHD-200 sample and ADNI are investigated to show the
superiority of our proposed methods
Feature Selection for Functional Data
In this paper we address the problem of feature selection when the data is
functional, we study several statistical procedures including classification,
regression and principal components. One advantage of the blinding procedure is
that it is very flexible since the features are defined by a set of functions,
relevant to the problem being studied, proposed by the user. Our method is
consistent under a set of quite general assumptions, and produces good results
with the real data examples that we analyze.Comment: 22 pages, 4 figure
Functional Regression
Functional data analysis (FDA) involves the analysis of data whose ideal
units of observation are functions defined on some continuous domain, and the
observed data consist of a sample of functions taken from some population,
sampled on a discrete grid. Ramsay and Silverman's 1997 textbook sparked the
development of this field, which has accelerated in the past 10 years to become
one of the fastest growing areas of statistics, fueled by the growing number of
applications yielding this type of data. One unique characteristic of FDA is
the need to combine information both across and within functions, which Ramsay
and Silverman called replication and regularization, respectively. This article
will focus on functional regression, the area of FDA that has received the most
attention in applications and methodological development. First will be an
introduction to basis functions, key building blocks for regularization in
functional regression methods, followed by an overview of functional regression
methods, split into three types: [1] functional predictor regression
(scalar-on-function), [2] functional response regression (function-on-scalar)
and [3] function-on-function regression. For each, the role of replication and
regularization will be discussed and the methodological development described
in a roughly chronological manner, at times deviating from the historical
timeline to group together similar methods. The primary focus is on modeling
and methodology, highlighting the modeling structures that have been developed
and the various regularization approaches employed. At the end is a brief
discussion describing potential areas of future development in this field
Estimates and bootstrap calibration for functional regression with scalar response
The author proposes new presmoothed FPCA-estimators and bootstrap methods for functional linear regression with scalar response, and a thresholding procedure, which detects hidden patterns, for nonparametric functional regression with scalar response
On the theory and practice of variable selection for functional data
Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Facultad de Ciencias, Departamento de Matemáticas. Fecha de lectura: 17-12-2015Functional Data Analysis (FDA) might be seen as a partial aspect of the modern mainstream
paradigm generally known as Big Data Analysis. The study of functional data requires new
methodologies that take into account their special features (e.g. infinite dimension and high
level of redundancy). Hence, the use of variable selection methods appears as a particularly
appealing choice in this context. Throughout this work, variable selection is considered in the
setting of supervised binary classification with functional data fX(t); t 2 [0; 1]g. By variable
selection we mean any dimension-reduction method which leads to replace the whole trajectory
fX(t); t 2 [0; 1]g, with a low-dimensional vector (X(t1); : : : ;X(td)) still keeping a similar classification
error. In this thesis we have addressed the “functional variable selection” in classification
problems from both theoretical and empirical perspectives.
We first restrict ourselves to the standard situation in which our functional data are generated
from Gaussian processes, with distributions P0 and P1 in both populations under study. The
classical Hajek-Feldman dichotomy establishes that P0 and P1 are either mutually absolutely continuous
with respect to each other (so there is a Radon-Nikodym (RN) density for each measure
with respect to the other one) or mutually singular. Unlike the case of finite dimensional Gaussian
measures, there are non-trivial examples of mutually singular distributions when dealing with
Gaussian stochastic processes. This work provides explicit expressions for the optimal (Bayes)
rule in several relevant problems of supervised binary (functional) classification under the absolutely
continuous case. Our approach relies on some classical results in the theory of stochastic
processes where the so-called Reproducing Kernel Hilbert Spaces (RKHS) play a special role.
This RKHS framework allows us also to give an interpretation, in terms of mutual singularity, for
the “near perfect classification” phenomenon described by Delaigle and Hall (2012a). We show
that the asymptotically optimal rule proposed by these authors can be identified with the sequence
of optimal rules for an approximating sequence of classification problems in the absolutely continuous
case.
The methodological contributions of this thesis are centred in three variable selection methods.
The obvious general criterion for variable selection is to choose the “most representative” or “most
relevant” variables. However, it is also clear that a purely relevance-oriented criterion could lead
to select many redundant variables. First, we provide a new model-based method for variable
selection in binary classification problems, which arises in a very natural way from the explicit
knowledge of the RN-derivatives and the underlying RKHS structure. As a consequence, the
optimal classifier in a wide class of functional classification problems can be expressed in terms
of a classical, linear finite-dimensional Fisher rule.
Our second proposal for variable selection is based on the idea of selecting the local maxima
(t1; : : : ; td) of the function V2
X (t) = V2(X(t); Y ), where V denotes the distance covariance
III
IV ABSTRACT
association measure for random variables due to Sz´ekely et al. (2007). This method provides a
simple natural way to deal with the relevance vs. redundancy trade-off which typically appears
in variable selection. This proposal is backed by a result of consistent estimation for the maxima
of V2
X . We also show different models for the underlying process X(t) under which the relevant
information is concentrated on the maxima of V2
X .
Our third proposal for variable selection consists of a new version of the minimum Redundancy
Maximum Relevance (mRMR) procedure proposed by Ding and Peng (2005) and Peng
et al. (2005). It is an algorithm to systematically perform variable selection, achieving a reasonable
trade-off between relevance and redundancy. In its original form, this procedure is based on
the use of the so-called mutual information criterion to assess relevance and redundancy. Keeping
the focus on functional data problems, we propose here a modified version of the mRMR
method, obtained by replacing the mutual information by the new distance correlation measure in
the general implementation of this method.
The performance of the new proposals is assessed through an extensive empirical study, including
about 400 simulated models (100 functional models 4 sample sizes) and real data examples,
aimed at comparing our variable selection methods with other standard procedures for
dimension reduction. The comparison involves different classifiers. A real problem with biomedical
data is also analysed in collaboration with researchers of Hospital Vall d’Hebron (Barcelona).
The overall conclusions of the empirical experiments are quite positive in favour of the proposed
methodologies.El Análisis de Datos Funcionales (FDA por sus siglas en inglés) puede ser visto como una
de las facetas del paradigma general conocido como Big Data Analysis. El estudio de los datos
funcionales requiere la utilización de nuevas metodologías que tengan en cuenta las características
especiales de estos datos (por ejemplo, la dimensión infinita y la elevada redundancia). En
este contexto, las técnicas de selección de variables parecen particularmente atractivas. A lo largo
de este trabajo, estudiaremos la selección de variables dentro del marco de la clasificación
supervisada binaria con datos funcionales fX(t); t 2 [0; 1]g. Por selección de variables entendemos
cualquier método de reducción de dimensión enfocado a remplazar las trayectorias completas
fX(t); t 2 [0; 1]g por vectores de baja dimensión (X(t1); : : : ;X(td)) conservando la informaci
ón discriminante. En esta tesis hemos abordado la “selección de variables funcional” en problemas
de clasificación tanto en su vertiente teórica como empírica.
Nos restringiremos esencialmente al caso general en que los datos funcionales están generados
por procesos Gaussianos, con distribuciones P0 y P1 en las distintas poblaciones. La dicotomía
de Hajek-Feldman establece que P0 y P1 sólo pueden ser mutuamente absolutamente continuas
(existiendo entonces una densidad de Radon-Nikodym (RN) de cada medida con respecto al a
otra) o mutuamente singulares. A diferencia del caso finito dimensional, cuando trabajamos con
procesos Gaussianos aparecen ejemplos no triviales de distribuciones mutuamente singulares. En
este trabajo se dan expresiones explíıcitas de la regla de clasificación óptima (Bayes) para algunos
problemas funcionales binarios relevantes en el contexto absolutamente continuo. Nuestro enfoque
se basa en algunos resultados clásicos de la teoría de procesos estocásticos, entre los que los
Espacios de Hilbert de Núcleos Reproductores (RKHS) desempeñan un papel fundamental. Este
marco RKHS nos permite también dar una interpretacién del fenómeno de la “clasificación casi
perfecta” descrito por Delaigle and Hall (2012a), en términos de la singularidad mutua de las
distribuciones.
Las contribuciones metodológicas de esta tesis se centran en tres métodos de selección de
variables. El criterio más obvio para seleccionar las variables sería elegir aquéllas “más representativas”
o “más relevantes”. Sin embargo, un criterio basado únicamente en la relevancia probablemente
conduciría a la selección de muchas variables redundantes. En primer lugar, proponemos
un nuevo método de selección de variables basado en modelo, que surge de manera natural del
conocimiento de las derivadas RN y de la estructura RKHS subyacente. Como consecuencia, el
clasificador óptimo para una amplia clase de problemas de clasificación funcional puede expresarse
en términos de la regla lineal de Fisher finito dimensional.
Nuestra segunda propuesta para selección de variables se basa en la idea de seleccionar los
máximos locales (t1; : : : ; td) de la función V2
X (t) = V2(X(t); Y ), donde V denota la covarianza
de distancias, medida de asociación entre variables aleatorias propuesta por Székely et al. (2007).
Este procedimiento se ocupa de manera natural del equilibrio entre relevancia y redundancia tıpico
de la selección de variables. Esta propuesta está respaldada por un resultado de consistencia en la
estimación de los máximos de V2
X . Además, se muestran distintos modelos de procesos subyacentes
X(t) para los que la información relevante se concentra en los máximos de V2
X .
La tercera propuesta para seleccionar variables es una nueva versión del método mRMR
(mínima Redundancia Máxima Relevancia), propuesto en Ding and Peng (2005) y Peng et al.
(2005). Este algoritmo realiza una selección de variables sistemática, consiguiendo un equilibrio
relevancia-redundancia razonable. El procedimiento mRMR original se basa en la utilización de
la información mutual para medir la relevancia y la redundancia. Manteniendo el problema funcional
como referencia, se propone una nueva versión de mRMR en la que la información mutua
es remplazada por la nueva correlación de distancias.
El rendimiento de las nuevas propuestas es evaluado mediante extensos estudios empíricos
con el objetivo de comparar nuestros métodos de selección de variables con otros procedimientos
de reducción de dimensiónn ya establecidos. Los experimentos incluyen 400 modelos de simulación
(100 modelos funcionales 4 tama˜nos muestrales) y ejemplos con datos reales. La comparativa
incluye distintos clasificadores. Además se ha analizado un problema real con datos biomédicos
en colaboración con investigadores del Hospital Vall d’Hebron (Barcelona). Los resultados del
estudio son, en general, bastante positivos para los nuevos métodosLos medios
para que pudiera llevar a cabo mi investigación provienen del Departamento de Matemáticas, el Instituto de Ingeniería
del Conocimiento y al programa FPI del MICIN