1,295 research outputs found
Time series estimation in a spiked signal regime
Working with missing or incomplete data is a universal problem in all sciences. In meteorology, temperature data streams can contain missing values due to sensor malfunctions. In geophysical remote sensing, missing data can may be attributed to irregular global sampling by an orbiting spacecraft. In a collaborative filtering application, like the Netflix Challenge, data is incomplete since it is not possible for all users to provide a recommendation on all items. Though we do not have access to complete data, it is still quite possible to forecast weather, and to recommend good movies on Netflix. The development of estimation algorithms that properly handle missing data make data imputation and forecasting possible.
The design of any estimation algorithm depends on the assumptions one can make on a given set of data. This thesis addresses the problem of estimating a noisy, incomplete time series of a dynamical system with unknown state evolution. The technique presented is TSCC (Transformed Spiked Co- variance Completion), a matrix completion algorithm for signal estimation that leverages the spiked signal model, an assumption that holds true for many high-dimensional datasets. The TSCC technique exploits this assumption to develop an estimator that is resilient to noise and accurately fills in missing data.
This thesis first addresses the specific estimation problem and the signal model that it follows. It then presents a survey of both standard and the state-of-the-art techniques in addition to an analysis of TSCC. These methods are used to solve the problem of estimating the state of dynamical system, with partial, noisy observations. Standard textbook techniques are not reliable in state estimation due to their inability to handle missing data and to generalize dynamical models. TSCC is an algorithm which addresses this estimation problem and accounts for the deficiencies. Concluding this thesis, several numerical experiments on both synthetic and real data demonstrate that TSCC outperforms these other techniques by forming a time-lagged embedding and estimating the dynamical modes of the system.
TSCC has an advantage over other techniques as it does not require knowledge of the state dynamics and that it leverages the asymptotic behavior of noisy, low-rank matrices to perform imputation and denoising. The TSCC technique assumes that a system can be represented by several dynamical modes which is analgous to a matrix having a low rank. Overall, TSCC is a state estimation algorithm that performs estimation on noisy and incomplete data without prior model assumptions. Numerical experiments show that TSCC is an enhancement of the current, accepted techniques which address the same estimation problem
Structural Variability from Noisy Tomographic Projections
In cryo-electron microscopy, the 3D electric potentials of an ensemble of
molecules are projected along arbitrary viewing directions to yield noisy 2D
images. The volume maps representing these potentials typically exhibit a great
deal of structural variability, which is described by their 3D covariance
matrix. Typically, this covariance matrix is approximately low-rank and can be
used to cluster the volumes or estimate the intrinsic geometry of the
conformation space. We formulate the estimation of this covariance matrix as a
linear inverse problem, yielding a consistent least-squares estimator. For
images of size -by- pixels, we propose an algorithm for calculating this
covariance estimator with computational complexity
, where the condition number
is empirically in the range --. Its efficiency relies on the
observation that the normal equations are equivalent to a deconvolution problem
in 6D. This is then solved by the conjugate gradient method with an appropriate
circulant preconditioner. The result is the first computationally efficient
algorithm for consistent estimation of 3D covariance from noisy projections. It
also compares favorably in runtime with respect to previously proposed
non-consistent estimators. Motivated by the recent success of eigenvalue
shrinkage procedures for high-dimensional covariance matrices, we introduce a
shrinkage procedure that improves accuracy at lower signal-to-noise ratios. We
evaluate our methods on simulated datasets and achieve classification results
comparable to state-of-the-art methods in shorter running time. We also present
results on clustering volumes in an experimental dataset, illustrating the
power of the proposed algorithm for practical determination of structural
variability.Comment: 52 pages, 11 figure
The phenome-wide distribution of genetic variance
A general observation emerging from estimates of additive genetic variance in sets of functionally or developmentally related traits is that much of the genetic variance is restricted to few trait combinations as a consequence of genetic covariance among traits. While this biased distribution of genetic variance among functionally related traits is now well documented, how it translates to the broader phenome and therefore any trait combination under selection in a given environment is unknown. We show that 8,750 gene expression traits measured in adult male Drosophila serrata exhibit widespread genetic covariance among random sets of five traits, implying that pleiotropy is common. Ultimately, to understand the phenome-wide distribution of genetic variance, very large additive genetic variance-covariance matrices (G) are required to be estimated. We draw upon recent advances in matrix theory for completing high-dimensional matrices to estimate the 8,750-trait G and show that large numbers of gene expression traits genetically covary as a consequence of a single genetic factor. Using gene ontology term enrichment analysis, we show that the major axis of genetic variance among expression traits successfully identified genetic covariance among genes involved in multiple modes of transcriptional regulation. Our approach provides a practical empirical framework for the genetic analysis of high-dimensional phenome-wide trait sets and for the investigation of the extent of high-dimensional genetic constraint
Nonparanormal Graph Quilting with Applications to Calcium Imaging
Probabilistic graphical models have become an important unsupervised learning
tool for detecting network structures for a variety of problems, including the
estimation of functional neuronal connectivity from two-photon calcium imaging
data. However, in the context of calcium imaging, technological limitations
only allow for partially overlapping layers of neurons in a brain region of
interest to be jointly recorded. In this case, graph estimation for the full
data requires inference for edge selection when many pairs of neurons have no
simultaneous observations. This leads to the Graph Quilting problem, which
seeks to estimate a graph in the presence of block-missingness in the empirical
covariance matrix. Solutions for the Graph Quilting problem have previously
been studied for Gaussian graphical models; however, neural activity data from
calcium imaging are often non-Gaussian, thereby requiring a more flexible
modeling approach. Thus, in our work, we study two approaches for nonparanormal
Graph Quilting based on the Gaussian copula graphical model, namely a maximum
likelihood procedure and a low-rank based framework. We provide theoretical
guarantees on edge recovery for the former approach under similar conditions to
those previously developed for the Gaussian setting, and we investigate the
empirical performance of both methods using simulations as well as real data
calcium imaging data. Our approaches yield more scientifically meaningful
functional connectivity estimates compared to existing Gaussian graph quilting
methods for this calcium imaging data set
- …