82 research outputs found
Fast Covariance Estimation for High-dimensional Functional Data
For smoothing covariance functions, we propose two fast algorithms that scale
linearly with the number of observations per function. Most available methods
and software cannot smooth covariance matrices of dimension with
; the recently introduced sandwich smoother is an exception, but it is
not adapted to smooth covariance matrices of large dimensions such as . Covariance matrices of order , and even , are
becoming increasingly common, e.g., in 2- and 3-dimensional medical imaging and
high-density wearable sensor data. We introduce two new algorithms that can
handle very large covariance matrices: 1) FACE: a fast implementation of the
sandwich smoother and 2) SVDS: a two-step procedure that first applies singular
value decomposition to the data matrix and then smoothes the eigenvectors.
Compared to existing techniques, these new algorithms are at least an order of
magnitude faster in high dimensions and drastically reduce memory requirements.
The new algorithms provide instantaneous (few seconds) smoothing for matrices
of dimension and very fast ( 10 minutes) smoothing for
. Although SVDS is simpler than FACE, we provide ready to use,
scalable R software for FACE. When incorporated into R package {\it refund},
FACE improves the speed of penalized functional regression by an order of
magnitude, even for data of normal size (). We recommend that FACE be
used in practice for the analysis of noisy and high-dimensional functional
data.Comment: 35 pages, 4 figure
Fast, Exact Bootstrap Principal Component Analysis for p>1 million
Many have suggested a bootstrap procedure for estimating the sampling
variability of principal component analysis (PCA) results. However, when the
number of measurements per subject () is much larger than the number of
subjects (), the challenge of calculating and storing the leading principal
components from each bootstrap sample can be computationally infeasible. To
address this, we outline methods for fast, exact calculation of bootstrap
principal components, eigenvalues, and scores. Our methods leverage the fact
that all bootstrap samples occupy the same -dimensional subspace as the
original sample. As a result, all bootstrap principal components are limited to
the same -dimensional subspace and can be efficiently represented by their
low dimensional coordinates in that subspace. Several uncertainty metrics can
be computed solely based on the bootstrap distribution of these low dimensional
coordinates, without calculating or storing the -dimensional bootstrap
components. Fast bootstrap PCA is applied to a dataset of sleep
electroencephalogram (EEG) recordings (, ), and to a dataset of
brain magnetic resonance images (MRIs) ( 3 million, ). For the
brain MRI dataset, our method allows for standard errors for the first 3
principal components based on 1000 bootstrap samples to be calculated on a
standard laptop in 47 minutes, as opposed to approximately 4 days with standard
methods.Comment: 25 pages, including 9 figures and link to R package. 2014-05-14
update: final formatting edits for journal submission, condensed figure
Covariance Estimation and Principal Component Analysis for Mixed-Type Functional Data with application to mHealth in Mood Disorders
Mobile digital health (mHealth) studies often collect multiple within-day
self-reported assessments of participants' behaviour and health. Indexed by
time of day, these assessments can be treated as functional observations of
continuous, truncated, ordinal, and binary type. We develop covariance
estimation and principal component analysis for mixed-type functional data like
that. We propose a semiparametric Gaussian copula model that assumes a
generalized latent non-paranormal process generating observed mixed-type
functional data and defining temporal dependence via a latent covariance. The
smooth estimate of latent covariance is constructed via Kendall's Tau bridging
method that incorporates smoothness within the bridging step. The approach is
then extended with methods for handling both dense and sparse sampling designs,
calculating subject-specific latent representations of observed data, latent
principal components and principal component scores. Importantly, the proposed
framework handles all four mixed types in a unified way. Simulation studies
show a competitive performance of the proposed method under both dense and
sparse sampling designs. The method is applied to data from 497 participants of
National Institute of Mental Health Family Study of the Mood Disorder Spectrum
to characterize the differences in within-day temporal patterns of mood in
individuals with the major mood disorder subtypes including Major Depressive
Disorder, and Type 1 and 2 Bipolar Disorder
Closed form GLM cumulants and GLMM fitting with a SQUAR-EM-LA 2 algorithm
Abstract We find closed form expressions for the standardized cumulants of generalized linear models. This reduces the complexity of their calculation from O(p 6 ) to O(p 2 ) operations which allows efficient construction of second-order saddlepoint approximations to the pdf of sufficient statistics. We adapt the result to obtain a closed form expression for the second-order Laplace approximation for a GLMM likelihood. Using this approximation, we develop a computationally highly efficient accelerated EM procedure, SQUAR-EM-LA 2 . The procedure is illustrated by fitting a GLMM to a well-known data set. Extensive simulations show the phenomenal performance of the approach. Matlab software is provided for implementing the proposed algorithm
Structured Functional Principal Component Analysis
Motivated by modern observational studies, we introduce a class of functional
models that expands nested and crossed designs. These models account for the
natural inheritance of correlation structure from sampling design in studies
where the fundamental sampling unit is a function or image. Inference is based
on functional quadratics and their relationship with the underlying covariance
structure of the latent processes. A computationally fast and scalable
estimation procedure is developed for ultra-high dimensional data. Methods are
illustrated in three examples: high-frequency accelerometer data for daily
activity, pitch linguistic data for phonetic analysis, and EEG data for
studying electrical brain activity during sleep
POPULATION VALUE DECOMPOSITION, A FRAMEWORK FOR THE ANALYSIS OF IMAGE POPULATIONS
Images, often stored in multidimensional arrays are fast becoming ubiquitous in medical and public health research. Analyzing populations of images is a statistical problem that raises a host of daunting challenges. The most severe challenge is that data sets incorporating images recorded for hundreds or thousands of subjects at multiple visits are massive. We introduce the population value decomposition (PVD), a general method for simultaneous dimensionality reduction of large populations of massive images. We show how PVD can seamlessly be incorporated into statistical modeling and lead to a new, transparent and fast inferential framework. Our methodology was motivated by and applied to the Sleep Heart Health Study, the largest community-based cohort study of sleep containing more than 85 billion observations on thousands of subjects at two visits
LONGITUDINAL HIGH-DIMENSIONAL DATA ANALYSIS
We develop a flexible framework for modeling high-dimensional functional and imaging data observed longitudinally. The approach decomposes the observed variability of high-dimensional observations measured at multiple visits into three additive components: a subject-specific functional random intercept that quantifies the cross-sectional variability, a subject-specific functional slope that quantifies the dynamic irreversible deformation over multiple visits, and a subject-visit specific functional deviation that quantifies exchangeable or reversible visit-to-visit changes. The proposed method is very fast, scalable to studies including ultra-high dimensional data, and can easily be adapted to and executed on modest computing infrastructures. The method is applied to the longitudinal analysis of diffusion tensor imaging (DTI) data of the corpus callosum of multiple sclerosis (MS) subjects. The study includes 176 subjects observed at 466 visits. For each subject and visit the study contains a registered DTI scan of the corpus callosum at roughly 30,000 voxels
- …