589 research outputs found
Mixtures, envelopes, and hierarchical duality
We develop a connection between mixture and envelope representations of
objective functions that arise frequently in statistics. We refer to this
connection using the term "hierarchical duality." Our results suggest an
interesting and previously under-exploited relationship between marginalization
and profiling, or equivalently between the Fenchel--Moreau theorem for convex
functions and the Bernstein--Widder theorem for Laplace transforms. We give
several different sets of conditions under which such a duality result obtains.
We then extend existing work on envelope representations in several ways,
including novel generalizations to variance-mean models and to multivariate
Gaussian location models. This turns out to provide an elegant missing-data
interpretation of the proximal gradient method, a widely used algorithm in
machine learning. We show several statistical applications in which the
proposed framework leads to easily implemented algorithms, including a robust
version of the fused lasso, nonlinear quantile regression via trend filtering,
and the binomial fused double Pareto model. Code for the examples is available
on GitHub at https://github.com/jgscott/hierduals
Nonregular and Minimax Estimation of Individualized Thresholds in High Dimension with Binary Responses
Given a large number of covariates , we consider the estimation of a
high-dimensional parameter in an individualized linear threshold
for a continuous variable , which minimizes the disagreement
between and a binary response . While the problem
can be formulated into the M-estimation framework, minimizing the corresponding
empirical risk function is computationally intractable due to discontinuity of
the sign function. Moreover, estimating even in the fixed-dimensional
setting is known as a nonregular problem leading to nonstandard asymptotic
theory. To tackle the computational and theoretical challenges in the
estimation of the high-dimensional parameter , we propose an empirical
risk minimization approach based on a regularized smoothed loss function. The
statistical and computational trade-off of the algorithm is investigated.
Statistically, we show that the finite sample error bound for estimating
in norm is , where is the
dimension of , is the sparsity level, is the sample size and
is the smoothness of the conditional density of given the response
and the covariates . The convergence rate is nonstandard and slower than
that in the classical Lasso problems. Furthermore, we prove that the resulting
estimator is minimax rate optimal up to a logarithmic factor. The Lepski's
method is developed to achieve the adaption to the unknown sparsity and
smoothness . Computationally, an efficient path-following algorithm is
proposed to compute the solution path. We show that this algorithm achieves
geometric rate of convergence for computing the whole path. Finally, we
evaluate the finite sample performance of the proposed estimator in simulation
studies and a real data analysis
Robust machine learning by median-of-means : theory and practice
We introduce new estimators for robust machine learning based on
median-of-means (MOM) estimators of the mean of real valued random variables.
These estimators achieve optimal rates of convergence under minimal assumptions
on the dataset. The dataset may also have been corrupted by outliers on which
no assumption is granted. We also analyze these new estimators with standard
tools from robust statistics. In particular, we revisit the concept of
breakdown point. We modify the original definition by studying the number of
outliers that a dataset can contain without deteriorating the estimation
properties of a given estimator. This new notion of breakdown number, that
takes into account the statistical performances of the estimators, is
non-asymptotic in nature and adapted for machine learning purposes. We proved
that the breakdown number of our estimator is of the order of (number of
observations)*(rate of convergence). For instance, the breakdown number of our
estimators for the problem of estimation of a d-dimensional vector with a noise
variance sigma^2 is sigma^2d and it becomes sigma^2 s log(d/s) when this vector
has only s non-zero component. Beyond this breakdown point, we proved that the
rate of convergence achieved by our estimator is (number of outliers) divided
by (number of observation).
Besides these theoretical guarantees, the major improvement brought by these
new estimators is that they are easily computable in practice. In fact,
basically any algorithm used to approximate the standard Empirical Risk
Minimizer (or its regularized versions) has a robust version approximating our
estimators. As a proof of concept, we study many algorithms for the classical
LASSO estimator. A byproduct of the MOM algorithms is a measure of depth of
data that can be used to detect outliers.Comment: 48 pages, 6 figure
A Dual-Dimer Method for Training Physics-Constrained Neural Networks with Minimax Architecture
Data sparsity is a common issue to train machine learning tools such as
neural networks for engineering and scientific applications, where experiments
and simulations are expensive. Recently physics-constrained neural networks
(PCNNs) were developed to reduce the required amount of training data. However,
the weights of different losses from data and physical constraints are adjusted
empirically in PCNNs. In this paper, a new physics-constrained neural network
with the minimax architecture (PCNN-MM) is proposed so that the weights of
different losses can be adjusted systematically. The training of the PCNN-MM is
searching the high-order saddle points of the objective function. A novel
saddle point search algorithm called Dual-Dimer method is developed. It is
demonstrated that the Dual-Dimer method is computationally more efficient than
the gradient descent ascent method for nonconvex-nonconcave functions and
provides additional eigenvalue information to verify search results. A heat
transfer example also shows that the convergence of PCNN-MMs is faster than
that of traditional PCNNs.Comment: 34 pages, 5 figures, accepted by neural network
An Extragradient-type Algorithm for Variational Inequality on Hadamard Manifolds
The aim of this paper is to present an extragradient method for variational
inequality associated to a point-to-set vector field in Hadamard manifolds and
to study its convergence properties. In order to present our method the concept
of -enlargement of maximal monotone vector fields is used and its
lower-semicontinuity is stablished in order to obtain the convergence of the
method in this new context
A survey of sparse representation: algorithms and applications
Sparse representation has attracted much attention from researchers in fields
of signal processing, image processing, computer vision and pattern
recognition. Sparse representation also has a good reputation in both
theoretical research and practical applications. Many different algorithms have
been proposed for sparse representation. The main purpose of this article is to
provide a comprehensive study and an updated review on sparse representation
and to supply a guidance for researchers. The taxonomy of sparse representation
methods can be studied from various viewpoints. For example, in terms of
different norm minimizations used in sparsity constraints, the methods can be
roughly categorized into five groups: sparse representation with -norm
minimization, sparse representation with -norm (0p1) minimization,
sparse representation with -norm minimization and sparse representation
with -norm minimization. In this paper, a comprehensive overview of
sparse representation is provided. The available sparse representation
algorithms can also be empirically categorized into four groups: greedy
strategy approximation, constrained optimization, proximity algorithm-based
optimization, and homotopy algorithm-based sparse representation. The
rationales of different algorithms in each category are analyzed and a wide
range of sparse representation applications are summarized, which could
sufficiently reveal the potential nature of the sparse representation theory.
Specifically, an experimentally comparative study of these sparse
representation algorithms was presented. The Matlab code used in this paper can
be available at: http://www.yongxu.org/lunwen.html.Comment: Published on IEEE Access, Vol. 3, pp. 490-530, 201
Efficient regularization with wavelet sparsity constraints in PAT
In this paper we consider the reconstruction problem of photoacoustic
tomography (PAT) with a flat observation surface. We develop a direct
reconstruction method that employs regularization with wavelet sparsity
constraints. To that end, we derive a wavelet-vaguelette decomposition (WVD)
for the PAT forward operator and a corresponding explicit reconstruction
formula in the case of exact data. In the case of noisy data, we combine the
WVD reconstruction formula with soft-thresholding which yields a spatially
adaptive estimation method. We demonstrate that our method is statistically
optimal for white random noise if the unknown function is assumed to lie in any
Besov-ball. We present generalizations of this approach and, in particular, we
discuss the combination of vaguelette soft-thresholding with a TV prior. We
also provide an efficient implementation of the vaguelette transform that leads
to fast image reconstruction algorithms supported by numerical results.Comment: 25 pages, 6 figure
Optimal rates for zero-order convex optimization: the power of two function evaluations
We consider derivative-free algorithms for stochastic and non-stochastic
convex optimization problems that use only function values rather than
gradients. Focusing on non-asymptotic bounds on convergence rates, we show that
if pairs of function values are available, algorithms for -dimensional
optimization that use gradient estimates based on random perturbations suffer a
factor of at most in convergence rate over traditional stochastic
gradient methods. We establish such results for both smooth and non-smooth
cases, sharpening previous analyses that suggested a worse dimension
dependence, and extend our results to the case of multiple ()
evaluations. We complement our algorithmic development with
information-theoretic lower bounds on the minimax convergence rate of such
problems, establishing the sharpness of our achievable results up to constant
(sometimes logarithmic) factors.Comment: 34 page
Learning from Comparisons and Choices
When tracking user-specific online activities, each user's preference is
revealed in the form of choices and comparisons. For example, a user's purchase
history is a record of her choices, i.e. which item was chosen among a subset
of offerings. A user's preferences can be observed either explicitly as in
movie ratings or implicitly as in viewing times of news articles. Given such
individualized ordinal data in the form of comparisons and choices, we address
the problem of collaboratively learning representations of the users and the
items. The learned features can be used to predict a user's preference of an
unseen item to be used in recommendation systems. This also allows one to
compute similarities among users and items to be used for categorization and
search. Motivated by the empirical successes of the MultiNomial Logit (MNL)
model in marketing and transportation, and also more recent successes in word
embedding and crowdsourced image embedding, we pose this problem as learning
the MNL model parameters that best explain the data. We propose a convex
relaxation for learning the MNL model, and show that it is minimax optimal up
to a logarithmic factor by comparing its performance to a fundamental lower
bound. This characterizes the minimax sample complexity of the problem, and
proves that the proposed estimator cannot be improved upon other than by a
logarithmic factor. Further, the analysis identifies how the accuracy depends
on the topology of sampling via the spectrum of the sampling graph. This
provides a guideline for designing surveys when one can choose which items are
to be compared. This is accompanied by numerical simulations on synthetic and
real data sets, confirming our theoretical predictions.Comment: 77 pages, 12 figures; added new experiments and references. arXiv
admin note: substantial text overlap with arXiv:1506.0794
Solving Non-Convex Non-Differentiable Min-Max Games using Proximal Gradient Method
Min-max saddle point games appear in a wide range of applications in machine
leaning and signal processing. Despite their wide applicability, theoretical
studies are mostly limited to the special convex-concave structure. While some
recent works generalized these results to special smooth non-convex cases, our
understanding of non-smooth scenarios is still limited. In this work, we study
special form of non-smooth min-max games when the objective function is
(strongly) convex with respect to one of the player's decision variable. We
show that a simple multi-step proximal gradient descent-ascent algorithm
converges to -first-order Nash equilibrium of the min-max game with
the number of gradient evaluations being polynomial in . We will
also show that our notion of stationarity is stronger than existing ones in the
literature. Finally, we evaluate the performance of the proposed algorithm
through adversarial attack on a LASSO estimator
- …