42 research outputs found
Hadamard Wirtinger Flow for Sparse Phase Retrieval
We consider the problem of reconstructing an -dimensional -sparse
signal from a set of noiseless magnitude-only measurements. Formulating the
problem as an unregularized empirical risk minimization task, we study the
sample complexity performance of gradient descent with Hadamard
parametrization, which we call Hadamard Wirtinger flow (HWF). Provided
knowledge of the signal sparsity , we prove that a single step of HWF is
able to recover the support from (modulo logarithmic term)
samples, where is the largest component of the signal in magnitude.
This support recovery procedure can be used to initialize existing
reconstruction methods and yields algorithms with total runtime proportional to
the cost of reading the data and improved sample complexity, which is linear in
when the signal contains at least one large component. We numerically
investigate the performance of HWF at convergence and show that, while not
requiring any explicit form of regularization nor knowledge of , HWF adapts
to the signal sparsity and reconstructs sparse signals with fewer measurements
than existing gradient based methods
Performance of Regularization for Sparse Convex Optimization
Despite widespread adoption in practice, guarantees for the LASSO and Group
LASSO are strikingly lacking in settings beyond statistical problems, and these
algorithms are usually considered to be a heuristic in the context of sparse
convex optimization on deterministic inputs. We give the first recovery
guarantees for the Group LASSO for sparse convex optimization with
vector-valued features. We show that if a sufficiently large Group LASSO
regularization is applied when minimizing a strictly convex function , then
the minimizer is a sparse vector supported on vector-valued features with the
largest norm of the gradient. Thus, repeating this procedure selects
the same set of features as the Orthogonal Matching Pursuit algorithm, which
admits recovery guarantees for any function with restricted strong
convexity and smoothness via weak submodularity arguments. This answers open
questions of Tibshirani et al. and Yasuda et al. Our result is the first to
theoretically explain the empirical success of the Group LASSO for convex
functions under general input instances assuming only restricted strong
convexity and smoothness. Our result also generalizes provable guarantees for
the Sequential Attention algorithm, which is a feature selection algorithm
inspired by the attention mechanism proposed by Yasuda et al.
As an application of our result, we give new results for the column subset
selection problem, which is well-studied when the loss is the Frobenius norm or
other entrywise matrix losses. We give the first result for general loss
functions for this problem that requires only restricted strong convexity and
smoothness
More is Less: Inducing Sparsity via Overparameterization
In deep learning it is common to overparameterize neural networks, that is,
to use more parameters than training samples. Quite surprisingly training the
neural network via (stochastic) gradient descent leads to models that
generalize very well, while classical statistics would suggest overfitting. In
order to gain understanding of this implicit bias phenomenon we study the
special case of sparse recovery (compressed sensing) which is of interest on
its own. More precisely, in order to reconstruct a vector from underdetermined
linear measurements, we introduce a corresponding overparameterized square loss
functional, where the vector to be reconstructed is deeply factorized into
several vectors. We show that, if there exists an exact solution, vanilla
gradient flow for the overparameterized loss functional converges to a good
approximation of the solution of minimal -norm. The latter is
well-known to promote sparse solutions. As a by-product, our results
significantly improve the sample complexity for compressed sensing via gradient
flow/descent on overparameterized models derived in previous works. The theory
accurately predicts the recovery rate in numerical experiments. Our proof
relies on analyzing a certain Bregman divergence of the flow. This bypasses the
obstacles caused by non-convexity and should be of independent interest
Gibbs Sampling using Anti-correlation Gaussian Data Augmentation, with Applications to L1-ball-type Models
L1-ball-type priors are a recent generalization of the spike-and-slab priors.
By transforming a continuous precursor distribution to the L1-ball boundary, it
induces exact zeros with positive prior and posterior probabilities. With great
flexibility in choosing the precursor and threshold distributions, we can
easily specify models under structured sparsity, such as those with dependent
probability for zeros and smoothness among the non-zeros. Motivated to
significantly accelerate the posterior computation, we propose a new data
augmentation that leads to a fast block Gibbs sampling algorithm. The latent
variable, named ``anti-correlation Gaussian'', cancels out the quadratic
exponent term in the latent Gaussian distribution, making the parameters of
interest conditionally independent so that they can be updated in a block.
Compared to existing algorithms such as the No-U-Turn sampler, the new blocked
Gibbs sampler has a very low computing cost per iteration and shows rapid
mixing of Markov chains. We establish the geometric ergodicity guarantee of the
algorithm in linear models. Further, we show useful extensions of our algorithm
for posterior estimation of general latent Gaussian models, such as those
involving multivariate truncated Gaussian or latent Gaussian process. Keywords:
Blocked Gibbs sampler; Fast Mixing of Markov Chains; Latent Gaussian Models;
Soft-thresholding
Non-negative Least Squares via Overparametrization
In many applications, solutions of numerical problems are required to be
non-negative, e.g., when retrieving pixel intensity values or physical
densities of a substance. In this context, non-negative least squares (NNLS) is
a ubiquitous tool, e.g., when seeking sparse solutions of high-dimensional
statistical problems. Despite vast efforts since the seminal work of Lawson and
Hanson in the '70s, the non-negativity assumption is still an obstacle for the
theoretical analysis and scalability of many off-the-shelf solvers. In the
different context of deep neural networks, we recently started to see that the
training of overparametrized models via gradient descent leads to surprising
generalization properties and the retrieval of regularized solutions. In this
paper, we prove that, by using an overparametrized formulation, NNLS solutions
can reliably be approximated via vanilla gradient flow. We furthermore
establish stability of the method against negative perturbations of the
ground-truth. Our simulations confirm that this allows the use of vanilla
gradient descent as a novel and scalable numerical solver for NNLS. From a
conceptual point of view, our work proposes a novel approach to trading
side-constraints in optimization problems against complexity of the
optimization landscape, which does not build upon the concept of Lagrangian
multipliers
Same Root Different Leaves: Time Series and Cross-Sectional Methods in Panel Data
A central goal in social science is to evaluate the causal effect of a
policy. One dominant approach is through panel data analysis in which the
behaviors of multiple units are observed over time. The information across time
and space motivates two general approaches: (i) horizontal regression (i.e.,
unconfoundedness), which exploits time series patterns, and (ii) vertical
regression (e.g., synthetic controls), which exploits cross-sectional patterns.
Conventional wisdom states that the two approaches are fundamentally different.
We establish this position to be partly false for estimation but generally true
for inference. In particular, we prove that both approaches yield identical
point estimates under several standard settings. For the same point estimate,
however, each approach quantifies uncertainty with respect to a distinct
estimand. In turn, the confidence interval developed for one estimand may have
incorrect coverage for another. This emphasizes that the source of randomness
that researchers assume has direct implications for the accuracy of inference
Tensor Networks for Dimensionality Reduction and Large-Scale Optimizations. Part 2 Applications and Future Perspectives
Part 2 of this monograph builds on the introduction to tensor networks and
their operations presented in Part 1. It focuses on tensor network models for
super-compressed higher-order representation of data/parameters and related
cost functions, while providing an outline of their applications in machine
learning and data analytics. A particular emphasis is on the tensor train (TT)
and Hierarchical Tucker (HT) decompositions, and their physically meaningful
interpretations which reflect the scalability of the tensor network approach.
Through a graphical approach, we also elucidate how, by virtue of the
underlying low-rank tensor approximations and sophisticated contractions of
core tensors, tensor networks have the ability to perform distributed
computations on otherwise prohibitively large volumes of data/parameters,
thereby alleviating or even eliminating the curse of dimensionality. The
usefulness of this concept is illustrated over a number of applied areas,
including generalized regression and classification (support tensor machines,
canonical correlation analysis, higher order partial least squares),
generalized eigenvalue decomposition, Riemannian optimization, and in the
optimization of deep neural networks. Part 1 and Part 2 of this work can be
used either as stand-alone separate texts, or indeed as a conjoint
comprehensive review of the exciting field of low-rank tensor networks and
tensor decompositions.Comment: 232 page
Implicit regularization in AI meets generalized hardness of approximation in optimization -- Sharp results for diagonal linear networks
Understanding the implicit regularization imposed by neural network
architectures and gradient based optimization methods is a key challenge in
deep learning and AI. In this work we provide sharp results for the implicit
regularization imposed by the gradient flow of Diagonal Linear Networks (DLNs)
in the over-parameterized regression setting and, potentially surprisingly,
link this to the phenomenon of phase transitions in generalized hardness of
approximation (GHA). GHA generalizes the phenomenon of hardness of
approximation from computer science to, among others, continuous and robust
optimization. It is well-known that the -norm of the gradient flow of
DLNs with tiny initialization converges to the objective function of basis
pursuit. We improve upon these results by showing that the gradient flow of
DLNs with tiny initialization approximates minimizers of the basis pursuit
optimization problem (as opposed to just the objective function), and we obtain
new and sharp convergence bounds w.r.t.\ the initialization size. Non-sharpness
of our results would imply that the GHA phenomenon would not occur for the
basis pursuit optimization problem -- which is a contradiction -- thus implying
sharpness. Moreover, we characterize minimizer of the
basis pursuit problem is chosen by the gradient flow whenever the minimizer is
not unique. Interestingly, this depends on the depth of the DLN