3 research outputs found

    Sampling and Subspace Methods for Learning Sparse Group Structures in Computer Vision

    Get PDF
    The unprecedented growth of data in volume and dimension has led to an increased number of computationally-demanding and data-driven decision-making methods in many disciplines, such as computer vision, genomics, finance, etc. Research on big data aims to understand and describe trends in massive volumes of high-dimensional data. High volume and dimension are the determining factors in both computational and time complexity of algorithms. The challenge grows when the data are formed of the union of group-structures of different dimensions embedded in a high-dimensional ambient space. To address the problem of high volume, we propose a sampling method referred to as the Sparse Withdrawal of Inliers in a First Trial (SWIFT), which determines the smallest sample size in one grab so that all group-structures are adequately represented and discovered with high probability. The key features of SWIFT are: (i) sparsity, which is independent of the population size; (ii) no prior knowledge of the distribution of data, or the number of underlying group-structures; and (iii) robustness in the presence of an overwhelming number of outliers. We report a comprehensive study of the proposed sampling method in terms of accuracy, functionality, and effectiveness in reducing the computational cost in various applications of computer vision. In the second part of this dissertation, we study dimensionality reduction for multi-structural data. We propose a probabilistic subspace clustering method that unifies soft- and hard-clustering in a single framework. This is achieved by introducing a delayed association of uncertain points to subspaces of lower dimensions based on a confidence measure. Delayed association yields higher accuracy in clustering subspaces that have ambiguities, i.e. due to intersections and high-level of outliers/noise, and hence leads to more accurate self-representation of underlying subspaces. Altogether, this dissertation addresses the key theoretical and practically issues of size and dimension in big data analysis

    Learning Low-Dimensional Models for Heterogeneous Data

    Full text link
    Modern data analysis increasingly involves extracting insights, trends and patterns from large and messy data collected from myriad heterogeneous sources. The scale and heterogeneity present exciting new opportunities for discovery, but also create a need for new statistical techniques and theory tailored to these settings. Traditional intuitions often no longer apply, e.g., when the number of variables measured is comparable to the number of samples obtained. A deeper theoretical understanding is needed to develop principled methods and guidelines for statistical data analysis. This dissertation studies the low-dimensional modeling of high-dimensional data in three heterogeneous settings. The first heterogeneity is in the quality of samples, and we consider the standard and ubiquitous low-dimensional modeling technique of Principal Component Analysis (PCA). We analyze how well PCA recovers underlying low-dimensional components from high-dimensional data when some samples are noisier than others (i.e., have heteroscedastic noise). Our analysis characterizes the penalty of heteroscedasticity for PCA, and we consider a weighted variant of PCA that explicitly accounts for heteroscedasticity by giving less weight to samples with more noise. We characterize the performance of weighted PCA for all choices of weights and derive optimal weights. The second heterogeneity is in the statistical properties of data, and we generalize the (increasingly) standard method of Canonical Polyadic (CP) tensor decomposition to allow for general statistical assumptions. Traditional CP tensor decomposition is most natural for data with all entries having Gaussian noise of homogeneous variance. Instead, the Generalized CP (GCP) tensor decomposition we propose allows for other statistical assumptions, and we demonstrate its flexibility on various datasets arising in social networks, neuroscience studies and weather patterns. Fitting GCP with alternative statistical assumptions provides new ways to explore trends in the data and yields improved predictions, e.g., of social network and mouse neural data. The third heterogeneity is in the class of samples, and we consider learning a mixture of low-dimensional subspaces. This model supposes that each sample comes from one of several (unknown) low-dimensional subspaces, that taken together form a union of subspaces (UoS). Samples from the same class come from the same subspace in the union. We consider an ensemble algorithm that clusters the samples, and analyze the approach to provide recovery guarantees. Finally, we propose a sequence of unions of subspaces (SUoS) model that systematically captures samples with heterogeneous complexity, and we describe some early ideas for learning and using SUoS models in patch-based image denoising.PHDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/150043/1/dahong_1.pd
    corecore