225 research outputs found

    Discovery of low-dimensional structure in high-dimensional inference problems

    Full text link
    Many learning and inference problems involve high-dimensional data such as images, video or genomic data, which cannot be processed efficiently using conventional methods due to their dimensionality. However, high-dimensional data often exhibit an inherent low-dimensional structure, for instance they can often be represented sparsely in some basis or domain. The discovery of an underlying low-dimensional structure is important to develop more robust and efficient analysis and processing algorithms. The first part of the dissertation investigates the statistical complexity of sparse recovery problems, including sparse linear and nonlinear regression models, feature selection and graph estimation. We present a framework that unifies sparse recovery problems and construct an analogy to channel coding in classical information theory. We perform an information-theoretic analysis to derive bounds on the number of samples required to reliably recover sparsity patterns independent of any specific recovery algorithm. In particular, we show that sample complexity can be tightly characterized using a mutual information formula similar to channel coding results. Next, we derive major extensions to this framework, including dependent input variables and a lower bound for sequential adaptive recovery schemes, which helps determine whether adaptivity provides performance gains. We compute statistical complexity bounds for various sparse recovery problems, showing our analysis improves upon the existing bounds and leads to intuitive results for new applications. In the second part, we investigate methods for improving the computational complexity of subgraph detection in graph-structured data, where we aim to discover anomalous patterns present in a connected subgraph of a given graph. This problem arises in many applications such as detection of network intrusions, community detection, detection of anomalous events in surveillance videos or disease outbreaks. Since optimization over connected subgraphs is a combinatorial and computationally difficult problem, we propose a convex relaxation that offers a principled approach to incorporating connectivity and conductance constraints on candidate subgraphs. We develop a novel nearly-linear time algorithm to solve the relaxed problem, establish convergence and consistency guarantees and demonstrate its feasibility and performance with experiments on real networks

    Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach

    Full text link
    This paper develops theoretical results regarding noisy 1-bit compressed sensing and sparse binomial regression. We show that a single convex program gives an accurate estimate of the signal, or coefficient vector, for both of these models. We demonstrate that an s-sparse signal in R^n can be accurately estimated from m = O(slog(n/s)) single-bit measurements using a simple convex program. This remains true even if each measurement bit is flipped with probability nearly 1/2. Worst-case (adversarial) noise can also be accounted for, and uniform results that hold for all sparse inputs are derived as well. In the terminology of sparse logistic regression, we show that O(slog(n/s)) Bernoulli trials are sufficient to estimate a coefficient vector in R^n which is approximately s-sparse. Moreover, the same convex program works for virtually all generalized linear models, in which the link function may be unknown. To our knowledge, these are the first results that tie together the theory of sparse logistic regression to 1-bit compressed sensing. Our results apply to general signal structures aside from sparsity; one only needs to know the size of the set K where signals reside. The size is given by the mean width of K, a computable quantity whose square serves as a robust extension of the dimension.Comment: 25 pages, 1 figure, error fixed in Lemma 4.

    Bayesian Spatio-Temporal Modeling for Forecasting, Trend Assessment and Spatial Trend Filtering

    Get PDF
    This work develops Bayesian spatio-temporal modeling techniques specifically aimed at studying several aspects of our motivating applications, to include vector-borne disease incidence and air pollution levels. A key attribute of the proposed techniques are that they are scalable to extremely large data sets which consist of spatio-temporally oriented observations. The scalability of our modeling strategies is accomplished in two primary ways. First, through the introduction of carefully constructed latent random variables we are able to develop Markov chain Monte Carlo (MCMC) sampling algorithms that consist primarily of Gibbs steps. This leads to the fast and easy updating of the model parameters from common distributions. Second, for the spatio-temporal aspects of the models, a novel sampling strategy for Gaussian Markov random fields (GRMFs) that can be easily implemented (in parallel) within MCMC sampling algorithms is used. The performance of the proposed modeling strategies are demonstrated through extensive numerical studies and are further used to analyze vector-borne disease data measured on canines throughout the conterminous United States and PM 2.5 levels measured at weather stations throughout the Eastern United States. In particular, we begin by developing a Poisson regression model that can be used to forecast the incidence of vector-borne disease throughout a large geographic area. The proposed model accounts for spatio-temporal dependence through a vector autoregression and is fit through a Metropolis-Hastings based Markov chain Monte Carlo (MCMC) sampling algorithm. The model is used to forecast the prevalence of Lyme disease (Chapter 2) and Anaplasmosis (Chapter 3) in canines throughout the United States. As a part of these studies we also evaluate the significance of various climatic and socio-economic drivers of disease. We then present (Chapter 4) the development of the \u27chromatic sampler\u27 for GMRFs. The chromatic sampler is an MCMC sampling technique that exploits the Markov property of GMRFs to sample large groups of parameters in parallel. A greedy algorithm for finding such groups of parameters is presented. The methodology is found to be superior, in terms of computational effort, to both full block and single-site updating. For assessing spatio-temporal trends, we develop (Chapter 5) a binomial regression model with spatially varying coefficients. This model uses Gaussian predictive processes to estimate spatially varying coefficients and a conditional autoregressive structure embedded in a vector autoregression to account for spatio-temporal dependence in the data. The methodology is capable of estimating both widespread regional and small scale local trends. A data augmentation strategy is used to develop a Gibbs based MCMC sampling routine. The approach is made computationally feasible through adopting the chromatic sampler for GMRFs to sample the spatio-temporal random effects. The model is applied to a dataset consisting of 16 million test results for antibodies to Borrelia burgdoferi and used to identify several areas of the United States experiencing increasing Lyme disease risk. For nonparametric functional estimation, we develop (Chapter 6) a Bayesian multidimensional trend filter (BMTF). The BMTF is a flexible nonparameteric estimator that extends traditional one dimensional trend filtering methods to multiple dimensions. The methodology is computationally scalable to a large support space and the expense of fitting the model is nearly independent of the number of observations. The methodology involves discretizing the support space and estimating a multidimensional step function over the discretized support. Two adaptive methods of discretization which allows the data to determine the resolution of the resulting function is presented. The BMTF is then used (Chapter 7) to allow for spatially varying coefficients within a quantile regression model. A data augmentation strategy is introduced which facilitates the development of a Gibbs based MCMC sampling routine. This methodology is developed to study various meteorological drivers of high levels of PM 2.5, a particularly hazardous form of air pollution consisting of particles less than 2.5 micrometers in diameter

    Optimal Nested Test Plan for Combinatorial Quantitative Group Testing

    Full text link
    We consider the quantitative group testing problem where the objective is to identify defective items in a given population based on results of tests performed on subsets of the population. Under the quantitative group testing model, the result of each test reveals the number of defective items in the tested group. The minimum number of tests achievable by nested test plans was established by Aigner and Schughart in 1985 within a minimax framework. The optimal nested test plan offering this performance, however, was not obtained. In this work, we establish the optimal nested test plan in closed form. This optimal nested test plan is also order optimal among all test plans as the population size approaches infinity. Using heavy-hitter detection as a case study, we show via simulation examples orders of magnitude improvement of the group testing approach over two prevailing sampling-based approaches in detection accuracy and counter consumption. Other applications include anomaly detection and wideband spectrum sensing in cognitive radio systems

    Nonadaptive lossy encoding of sparse signals

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 67-68).At high rate, a sparse signal is optimally encoded through an adaptive strategy that finds and encodes the signal's representation in the sparsity-inducing basis. This thesis examines how much the distortion rate (D(R)) performance of a nonadaptive encoder, one that is not allowed to explicitly specify the sparsity pattern, can approach that of an adaptive encoder. Two methods are studied: first, optimizing the number of nonadaptive measurements that must be encoded and second, using a binned quantization strategy. Both methods are applicable to a setting in which the decoder knows the sparsity basis and the sparsity level. Through small problem size simulations, it is shown that a considerable performance gain can be achieved and that the number of measurements controls a tradeoff between decoding complexity and achievable D(R).by Ruby J. Pai.M.Eng

    Low Complexity Regularization of Linear Inverse Problems

    Full text link
    Inverse problems and regularization theory is a central theme in contemporary signal processing, where the goal is to reconstruct an unknown signal from partial indirect, and possibly noisy, measurements of it. A now standard method for recovering the unknown signal is to solve a convex optimization problem that enforces some prior knowledge about its structure. This has proved efficient in many problems routinely encountered in imaging sciences, statistics and machine learning. This chapter delivers a review of recent advances in the field where the regularization prior promotes solutions conforming to some notion of simplicity/low-complexity. These priors encompass as popular examples sparsity and group sparsity (to capture the compressibility of natural signals and images), total variation and analysis sparsity (to promote piecewise regularity), and low-rank (as natural extension of sparsity to matrix-valued data). Our aim is to provide a unified treatment of all these regularizations under a single umbrella, namely the theory of partial smoothness. This framework is very general and accommodates all low-complexity regularizers just mentioned, as well as many others. Partial smoothness turns out to be the canonical way to encode low-dimensional models that can be linear spaces or more general smooth manifolds. This review is intended to serve as a one stop shop toward the understanding of the theoretical properties of the so-regularized solutions. It covers a large spectrum including: (i) recovery guarantees and stability to noise, both in terms of â„“2\ell^2-stability and model (manifold) identification; (ii) sensitivity analysis to perturbations of the parameters involved (in particular the observations), with applications to unbiased risk estimation ; (iii) convergence properties of the forward-backward proximal splitting scheme, that is particularly well suited to solve the corresponding large-scale regularized optimization problem
    • …
    corecore