82 research outputs found

    Fast Covariance Estimation for High-dimensional Functional Data

    Get PDF
    For smoothing covariance functions, we propose two fast algorithms that scale linearly with the number of observations per function. Most available methods and software cannot smooth covariance matrices of dimension J×JJ \times J with J>500J>500; the recently introduced sandwich smoother is an exception, but it is not adapted to smooth covariance matrices of large dimensions such as J≥10,000J \ge 10,000. Covariance matrices of order J=10,000J=10,000, and even J=100,000J=100,000, are becoming increasingly common, e.g., in 2- and 3-dimensional medical imaging and high-density wearable sensor data. We introduce two new algorithms that can handle very large covariance matrices: 1) FACE: a fast implementation of the sandwich smoother and 2) SVDS: a two-step procedure that first applies singular value decomposition to the data matrix and then smoothes the eigenvectors. Compared to existing techniques, these new algorithms are at least an order of magnitude faster in high dimensions and drastically reduce memory requirements. The new algorithms provide instantaneous (few seconds) smoothing for matrices of dimension J=10,000J=10,000 and very fast (<< 10 minutes) smoothing for J=100,000J=100,000. Although SVDS is simpler than FACE, we provide ready to use, scalable R software for FACE. When incorporated into R package {\it refund}, FACE improves the speed of penalized functional regression by an order of magnitude, even for data of normal size (J<500J <500). We recommend that FACE be used in practice for the analysis of noisy and high-dimensional functional data.Comment: 35 pages, 4 figure

    Fast, Exact Bootstrap Principal Component Analysis for p>1 million

    Full text link
    Many have suggested a bootstrap procedure for estimating the sampling variability of principal component analysis (PCA) results. However, when the number of measurements per subject (pp) is much larger than the number of subjects (nn), the challenge of calculating and storing the leading principal components from each bootstrap sample can be computationally infeasible. To address this, we outline methods for fast, exact calculation of bootstrap principal components, eigenvalues, and scores. Our methods leverage the fact that all bootstrap samples occupy the same nn-dimensional subspace as the original sample. As a result, all bootstrap principal components are limited to the same nn-dimensional subspace and can be efficiently represented by their low dimensional coordinates in that subspace. Several uncertainty metrics can be computed solely based on the bootstrap distribution of these low dimensional coordinates, without calculating or storing the pp-dimensional bootstrap components. Fast bootstrap PCA is applied to a dataset of sleep electroencephalogram (EEG) recordings (p=900p=900, n=392n=392), and to a dataset of brain magnetic resonance images (MRIs) (p≈p\approx 3 million, n=352n=352). For the brain MRI dataset, our method allows for standard errors for the first 3 principal components based on 1000 bootstrap samples to be calculated on a standard laptop in 47 minutes, as opposed to approximately 4 days with standard methods.Comment: 25 pages, including 9 figures and link to R package. 2014-05-14 update: final formatting edits for journal submission, condensed figure

    Covariance Estimation and Principal Component Analysis for Mixed-Type Functional Data with application to mHealth in Mood Disorders

    Full text link
    Mobile digital health (mHealth) studies often collect multiple within-day self-reported assessments of participants' behaviour and health. Indexed by time of day, these assessments can be treated as functional observations of continuous, truncated, ordinal, and binary type. We develop covariance estimation and principal component analysis for mixed-type functional data like that. We propose a semiparametric Gaussian copula model that assumes a generalized latent non-paranormal process generating observed mixed-type functional data and defining temporal dependence via a latent covariance. The smooth estimate of latent covariance is constructed via Kendall's Tau bridging method that incorporates smoothness within the bridging step. The approach is then extended with methods for handling both dense and sparse sampling designs, calculating subject-specific latent representations of observed data, latent principal components and principal component scores. Importantly, the proposed framework handles all four mixed types in a unified way. Simulation studies show a competitive performance of the proposed method under both dense and sparse sampling designs. The method is applied to data from 497 participants of National Institute of Mental Health Family Study of the Mood Disorder Spectrum to characterize the differences in within-day temporal patterns of mood in individuals with the major mood disorder subtypes including Major Depressive Disorder, and Type 1 and 2 Bipolar Disorder

    Closed form GLM cumulants and GLMM fitting with a SQUAR-EM-LA 2 algorithm

    Get PDF
    Abstract We find closed form expressions for the standardized cumulants of generalized linear models. This reduces the complexity of their calculation from O(p 6 ) to O(p 2 ) operations which allows efficient construction of second-order saddlepoint approximations to the pdf of sufficient statistics. We adapt the result to obtain a closed form expression for the second-order Laplace approximation for a GLMM likelihood. Using this approximation, we develop a computationally highly efficient accelerated EM procedure, SQUAR-EM-LA 2 . The procedure is illustrated by fitting a GLMM to a well-known data set. Extensive simulations show the phenomenal performance of the approach. Matlab software is provided for implementing the proposed algorithm

    Structured Functional Principal Component Analysis

    Get PDF
    Motivated by modern observational studies, we introduce a class of functional models that expands nested and crossed designs. These models account for the natural inheritance of correlation structure from sampling design in studies where the fundamental sampling unit is a function or image. Inference is based on functional quadratics and their relationship with the underlying covariance structure of the latent processes. A computationally fast and scalable estimation procedure is developed for ultra-high dimensional data. Methods are illustrated in three examples: high-frequency accelerometer data for daily activity, pitch linguistic data for phonetic analysis, and EEG data for studying electrical brain activity during sleep

    POPULATION VALUE DECOMPOSITION, A FRAMEWORK FOR THE ANALYSIS OF IMAGE POPULATIONS

    Get PDF
    Images, often stored in multidimensional arrays are fast becoming ubiquitous in medical and public health research. Analyzing populations of images is a statistical problem that raises a host of daunting challenges. The most severe challenge is that data sets incorporating images recorded for hundreds or thousands of subjects at multiple visits are massive. We introduce the population value decomposition (PVD), a general method for simultaneous dimensionality reduction of large populations of massive images. We show how PVD can seamlessly be incorporated into statistical modeling and lead to a new, transparent and fast inferential framework. Our methodology was motivated by and applied to the Sleep Heart Health Study, the largest community-based cohort study of sleep containing more than 85 billion observations on thousands of subjects at two visits

    LONGITUDINAL HIGH-DIMENSIONAL DATA ANALYSIS

    Get PDF
    We develop a flexible framework for modeling high-dimensional functional and imaging data observed longitudinally. The approach decomposes the observed variability of high-dimensional observations measured at multiple visits into three additive components: a subject-specific functional random intercept that quantifies the cross-sectional variability, a subject-specific functional slope that quantifies the dynamic irreversible deformation over multiple visits, and a subject-visit specific functional deviation that quantifies exchangeable or reversible visit-to-visit changes. The proposed method is very fast, scalable to studies including ultra-high dimensional data, and can easily be adapted to and executed on modest computing infrastructures. The method is applied to the longitudinal analysis of diffusion tensor imaging (DTI) data of the corpus callosum of multiple sclerosis (MS) subjects. The study includes 176 subjects observed at 466 visits. For each subject and visit the study contains a registered DTI scan of the corpus callosum at roughly 30,000 voxels
    • …
    corecore