225 research outputs found
Discovery of low-dimensional structure in high-dimensional inference problems
Many learning and inference problems involve high-dimensional data such as images, video or genomic data, which cannot be processed efficiently using conventional methods due to their dimensionality. However, high-dimensional data often exhibit an inherent low-dimensional structure, for instance they can often be represented sparsely in some basis or domain. The discovery of an underlying low-dimensional structure is important to develop more robust and efficient analysis and processing algorithms.
The first part of the dissertation investigates the statistical complexity of sparse recovery problems, including sparse linear and nonlinear regression models, feature selection and graph estimation. We present a framework that unifies sparse recovery problems and construct an analogy to channel coding in classical information theory. We perform an information-theoretic analysis to derive bounds on the number of samples required to reliably recover sparsity patterns independent of any specific recovery algorithm. In particular, we show that sample complexity can be tightly characterized using a mutual information formula similar to channel coding results. Next, we derive major extensions to this framework, including dependent input variables and a lower bound for sequential adaptive recovery schemes, which helps determine whether adaptivity provides performance gains. We compute statistical complexity bounds for various sparse recovery problems, showing our analysis improves upon the existing bounds and leads to intuitive results for new applications.
In the second part, we investigate methods for improving the computational complexity of subgraph detection in graph-structured data, where we aim to discover anomalous patterns present in a connected subgraph of a given graph. This problem arises in many applications such as detection of network intrusions, community detection, detection of anomalous events in surveillance videos or disease outbreaks. Since optimization over connected subgraphs is a combinatorial and computationally difficult problem, we propose a convex relaxation that offers a principled approach to incorporating connectivity and conductance constraints on candidate subgraphs. We develop a novel nearly-linear time algorithm to solve the relaxed problem, establish convergence and consistency guarantees and demonstrate its feasibility and performance with experiments on real networks
Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach
This paper develops theoretical results regarding noisy 1-bit compressed
sensing and sparse binomial regression. We show that a single convex program
gives an accurate estimate of the signal, or coefficient vector, for both of
these models. We demonstrate that an s-sparse signal in R^n can be accurately
estimated from m = O(slog(n/s)) single-bit measurements using a simple convex
program. This remains true even if each measurement bit is flipped with
probability nearly 1/2. Worst-case (adversarial) noise can also be accounted
for, and uniform results that hold for all sparse inputs are derived as well.
In the terminology of sparse logistic regression, we show that O(slog(n/s))
Bernoulli trials are sufficient to estimate a coefficient vector in R^n which
is approximately s-sparse. Moreover, the same convex program works for
virtually all generalized linear models, in which the link function may be
unknown. To our knowledge, these are the first results that tie together the
theory of sparse logistic regression to 1-bit compressed sensing. Our results
apply to general signal structures aside from sparsity; one only needs to know
the size of the set K where signals reside. The size is given by the mean width
of K, a computable quantity whose square serves as a robust extension of the
dimension.Comment: 25 pages, 1 figure, error fixed in Lemma 4.
Bayesian Spatio-Temporal Modeling for Forecasting, Trend Assessment and Spatial Trend Filtering
This work develops Bayesian spatio-temporal modeling techniques specifically aimed at studying several aspects of our motivating applications, to include vector-borne disease incidence and air pollution levels. A key attribute of the proposed techniques are that they are scalable to extremely large data sets which consist of spatio-temporally oriented observations. The scalability of our modeling strategies is accomplished in two primary ways. First, through the introduction of carefully constructed latent random variables we are able to develop Markov chain Monte Carlo (MCMC) sampling algorithms that consist primarily of Gibbs steps. This leads to the fast and easy updating of the model parameters from common distributions. Second, for the spatio-temporal aspects of the models, a novel sampling strategy for Gaussian Markov random fields (GRMFs) that can be easily implemented (in parallel) within MCMC sampling algorithms is used. The performance of the proposed modeling strategies are demonstrated through extensive numerical studies and are further used to analyze vector-borne disease data measured on canines throughout the conterminous United States and PM 2.5 levels measured at weather stations throughout the Eastern United States.
In particular, we begin by developing a Poisson regression model that can be used to forecast the incidence of vector-borne disease throughout a large geographic area. The proposed model accounts for spatio-temporal dependence through a vector autoregression and is fit through a Metropolis-Hastings based Markov chain Monte Carlo (MCMC) sampling algorithm. The model is used to forecast the prevalence of Lyme disease (Chapter 2) and Anaplasmosis (Chapter 3) in canines throughout the United States. As a part of these studies we also evaluate the significance of various climatic and socio-economic drivers of disease. We then present (Chapter 4) the development of the \u27chromatic sampler\u27 for GMRFs. The chromatic sampler is an MCMC sampling technique that exploits the Markov property of GMRFs to sample large groups of parameters in parallel. A greedy algorithm for finding such groups of parameters is presented. The methodology is found to be superior, in terms of computational effort, to both full block and single-site updating. For assessing spatio-temporal trends, we develop (Chapter 5) a binomial regression model with spatially varying coefficients. This model uses Gaussian predictive processes to estimate spatially varying coefficients and a conditional autoregressive structure embedded in a vector autoregression to account for spatio-temporal dependence in the data. The methodology is capable of estimating both widespread regional and small scale local trends. A data augmentation strategy is used to develop a Gibbs based MCMC sampling routine. The approach is made computationally feasible through adopting the chromatic sampler for GMRFs to sample the spatio-temporal random effects. The model is applied to a dataset consisting of 16 million test results for antibodies to Borrelia burgdoferi and used to identify several areas of the United States experiencing increasing Lyme disease risk. For nonparametric functional estimation, we develop (Chapter 6) a Bayesian multidimensional trend filter (BMTF). The BMTF is a flexible nonparameteric estimator that extends traditional one dimensional trend filtering methods to multiple dimensions. The methodology is computationally scalable to a large support space and the expense of fitting the model is nearly independent of the number of observations. The methodology involves discretizing the support space and estimating a multidimensional step function over the discretized support. Two adaptive methods of discretization which allows the data to determine the resolution of the resulting function is presented. The BMTF is then used (Chapter 7) to allow for spatially varying coefficients within a quantile regression model. A data augmentation strategy is introduced which facilitates the development of a Gibbs based MCMC sampling routine. This methodology is developed to study various meteorological drivers of high levels of PM 2.5, a particularly hazardous form of air pollution consisting of particles less than 2.5 micrometers in diameter
Optimal Nested Test Plan for Combinatorial Quantitative Group Testing
We consider the quantitative group testing problem where the objective is to
identify defective items in a given population based on results of tests
performed on subsets of the population. Under the quantitative group testing
model, the result of each test reveals the number of defective items in the
tested group. The minimum number of tests achievable by nested test plans was
established by Aigner and Schughart in 1985 within a minimax framework. The
optimal nested test plan offering this performance, however, was not obtained.
In this work, we establish the optimal nested test plan in closed form. This
optimal nested test plan is also order optimal among all test plans as the
population size approaches infinity. Using heavy-hitter detection as a case
study, we show via simulation examples orders of magnitude improvement of the
group testing approach over two prevailing sampling-based approaches in
detection accuracy and counter consumption. Other applications include anomaly
detection and wideband spectrum sensing in cognitive radio systems
Nonadaptive lossy encoding of sparse signals
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 67-68).At high rate, a sparse signal is optimally encoded through an adaptive strategy that finds and encodes the signal's representation in the sparsity-inducing basis. This thesis examines how much the distortion rate (D(R)) performance of a nonadaptive encoder, one that is not allowed to explicitly specify the sparsity pattern, can approach that of an adaptive encoder. Two methods are studied: first, optimizing the number of nonadaptive measurements that must be encoded and second, using a binned quantization strategy. Both methods are applicable to a setting in which the decoder knows the sparsity basis and the sparsity level. Through small problem size simulations, it is shown that a considerable performance gain can be achieved and that the number of measurements controls a tradeoff between decoding complexity and achievable D(R).by Ruby J. Pai.M.Eng
Low Complexity Regularization of Linear Inverse Problems
Inverse problems and regularization theory is a central theme in contemporary
signal processing, where the goal is to reconstruct an unknown signal from
partial indirect, and possibly noisy, measurements of it. A now standard method
for recovering the unknown signal is to solve a convex optimization problem
that enforces some prior knowledge about its structure. This has proved
efficient in many problems routinely encountered in imaging sciences,
statistics and machine learning. This chapter delivers a review of recent
advances in the field where the regularization prior promotes solutions
conforming to some notion of simplicity/low-complexity. These priors encompass
as popular examples sparsity and group sparsity (to capture the compressibility
of natural signals and images), total variation and analysis sparsity (to
promote piecewise regularity), and low-rank (as natural extension of sparsity
to matrix-valued data). Our aim is to provide a unified treatment of all these
regularizations under a single umbrella, namely the theory of partial
smoothness. This framework is very general and accommodates all low-complexity
regularizers just mentioned, as well as many others. Partial smoothness turns
out to be the canonical way to encode low-dimensional models that can be linear
spaces or more general smooth manifolds. This review is intended to serve as a
one stop shop toward the understanding of the theoretical properties of the
so-regularized solutions. It covers a large spectrum including: (i) recovery
guarantees and stability to noise, both in terms of -stability and
model (manifold) identification; (ii) sensitivity analysis to perturbations of
the parameters involved (in particular the observations), with applications to
unbiased risk estimation ; (iii) convergence properties of the forward-backward
proximal splitting scheme, that is particularly well suited to solve the
corresponding large-scale regularized optimization problem
- …