6 research outputs found
Learning the Structure for Structured Sparsity
Structured sparsity has recently emerged in statistics, machine learning and
signal processing as a promising paradigm for learning in high-dimensional
settings. All existing methods for learning under the assumption of structured
sparsity rely on prior knowledge on how to weight (or how to penalize)
individual subsets of variables during the subset selection process, which is
not available in general. Inferring group weights from data is a key open
research problem in structured sparsity.In this paper, we propose a Bayesian
approach to the problem of group weight learning. We model the group weights as
hyperparameters of heavy-tailed priors on groups of variables and derive an
approximate inference scheme to infer these hyperparameters. We empirically
show that we are able to recover the model hyperparameters when the data are
generated from the model, and we demonstrate the utility of learning weights in
synthetic and real denoising problems
Exclusive Group Lasso for Structured Variable Selection
A structured variable selection problem is considered in which the
covariates, divided into predefined groups, activate according to sparse
patterns with few nonzero entries per group. Capitalizing on the concept of
atomic norm, a composite norm can be properly designed to promote such
exclusive group sparsity patterns. The resulting norm lends itself to efficient
and flexible regularized optimization algorithms for support recovery, like the
proximal algorithm. Moreover, an active set algorithm is proposed that builds
the solution by successively including structure atoms into the estimated
support. It is also shown that such an algorithm can be tailored to match more
rigid structures than plain exclusive group sparsity. Asymptotic consistency
analysis (with both the number of parameters as well as the number of groups
growing with the observation size) establishes the effectiveness of the
proposed solution in terms of signed support recovery under conventional
assumptions. Finally, a set of numerical simulations further corroborates the
results.Comment: 36 pages, 2 figures. Not submitted for publication. New licens
Scaling Machine Learning Data Repair Systems for Sparse Datasets
Machine learning data repair systems (e.g. HoloClean) have achieved state-of-the-art performance for the data repair problem on many datasets. However, these systems face significant challenges with sparse datasets. In this work, the challenges presented by such datasets to machine learning data repair systems are investigated. Dataset-independent methods are presented to mitigate the effects of data sparseness. Finally, experimental results are validated on a large, sparse real-world dataset: Census. Showing that the problem size can be reduced by more than 70%, saving significant computational costs, while still getting high accuracy data repairs (94.5% accuracy)
Learning Low-Dimensional Models for Heterogeneous Data
Modern data analysis increasingly involves extracting insights, trends and patterns from large and messy data collected from myriad heterogeneous sources. The scale and heterogeneity present exciting new opportunities for discovery, but also create a need for new statistical techniques and theory tailored to these settings. Traditional intuitions often no longer apply, e.g., when the number of variables measured is comparable to the number of samples obtained. A deeper theoretical understanding is needed to develop principled methods and guidelines for statistical data analysis. This dissertation studies the low-dimensional modeling of high-dimensional data in three heterogeneous settings.
The first heterogeneity is in the quality of samples, and we consider the standard and ubiquitous low-dimensional modeling technique of Principal Component Analysis (PCA). We analyze how well PCA recovers underlying low-dimensional components from high-dimensional data when some samples are noisier than others (i.e., have heteroscedastic noise). Our analysis characterizes the penalty of heteroscedasticity for PCA, and we consider a weighted variant of PCA that explicitly accounts for heteroscedasticity by giving less weight to samples with more noise. We characterize the performance of weighted PCA for all choices of weights and derive optimal weights.
The second heterogeneity is in the statistical properties of data, and we generalize the (increasingly) standard method of Canonical Polyadic (CP) tensor decomposition to allow for general statistical assumptions. Traditional CP tensor decomposition is most natural for data with all entries having Gaussian noise of homogeneous variance. Instead, the Generalized CP (GCP) tensor decomposition we propose allows for other statistical assumptions, and we demonstrate its flexibility on various datasets arising in social networks, neuroscience studies and weather patterns. Fitting GCP with alternative statistical assumptions provides new ways to explore trends in the data and yields improved predictions, e.g., of social network and mouse neural data.
The third heterogeneity is in the class of samples, and we consider learning a mixture of low-dimensional subspaces. This model supposes that each sample comes from one of several (unknown) low-dimensional subspaces, that taken together form a union of subspaces (UoS). Samples from the same class come from the same subspace in the union. We consider an ensemble algorithm that clusters the samples, and analyze the approach to provide recovery guarantees. Finally, we propose a sequence of unions of subspaces (SUoS) model that systematically captures samples with heterogeneous complexity, and we describe some early ideas for learning and using SUoS models in patch-based image denoising.PHDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/150043/1/dahong_1.pd