589 research outputs found

    Mixtures, envelopes, and hierarchical duality

    Full text link
    We develop a connection between mixture and envelope representations of objective functions that arise frequently in statistics. We refer to this connection using the term "hierarchical duality." Our results suggest an interesting and previously under-exploited relationship between marginalization and profiling, or equivalently between the Fenchel--Moreau theorem for convex functions and the Bernstein--Widder theorem for Laplace transforms. We give several different sets of conditions under which such a duality result obtains. We then extend existing work on envelope representations in several ways, including novel generalizations to variance-mean models and to multivariate Gaussian location models. This turns out to provide an elegant missing-data interpretation of the proximal gradient method, a widely used algorithm in machine learning. We show several statistical applications in which the proposed framework leads to easily implemented algorithms, including a robust version of the fused lasso, nonlinear quantile regression via trend filtering, and the binomial fused double Pareto model. Code for the examples is available on GitHub at https://github.com/jgscott/hierduals

    Nonregular and Minimax Estimation of Individualized Thresholds in High Dimension with Binary Responses

    Full text link
    Given a large number of covariates ZZ, we consider the estimation of a high-dimensional parameter θ\theta in an individualized linear threshold θTZ\theta^T Z for a continuous variable XX, which minimizes the disagreement between sign(XθTZ)\text{sign}(X-\theta^TZ) and a binary response YY. While the problem can be formulated into the M-estimation framework, minimizing the corresponding empirical risk function is computationally intractable due to discontinuity of the sign function. Moreover, estimating θ\theta even in the fixed-dimensional setting is known as a nonregular problem leading to nonstandard asymptotic theory. To tackle the computational and theoretical challenges in the estimation of the high-dimensional parameter θ\theta, we propose an empirical risk minimization approach based on a regularized smoothed loss function. The statistical and computational trade-off of the algorithm is investigated. Statistically, we show that the finite sample error bound for estimating θ\theta in 2\ell_2 norm is (slogd/n)β/(2β+1)(s\log d/n)^{\beta/(2\beta+1)}, where dd is the dimension of θ\theta, ss is the sparsity level, nn is the sample size and β\beta is the smoothness of the conditional density of XX given the response YY and the covariates ZZ. The convergence rate is nonstandard and slower than that in the classical Lasso problems. Furthermore, we prove that the resulting estimator is minimax rate optimal up to a logarithmic factor. The Lepski's method is developed to achieve the adaption to the unknown sparsity ss and smoothness β\beta. Computationally, an efficient path-following algorithm is proposed to compute the solution path. We show that this algorithm achieves geometric rate of convergence for computing the whole path. Finally, we evaluate the finite sample performance of the proposed estimator in simulation studies and a real data analysis

    Robust machine learning by median-of-means : theory and practice

    Full text link
    We introduce new estimators for robust machine learning based on median-of-means (MOM) estimators of the mean of real valued random variables. These estimators achieve optimal rates of convergence under minimal assumptions on the dataset. The dataset may also have been corrupted by outliers on which no assumption is granted. We also analyze these new estimators with standard tools from robust statistics. In particular, we revisit the concept of breakdown point. We modify the original definition by studying the number of outliers that a dataset can contain without deteriorating the estimation properties of a given estimator. This new notion of breakdown number, that takes into account the statistical performances of the estimators, is non-asymptotic in nature and adapted for machine learning purposes. We proved that the breakdown number of our estimator is of the order of (number of observations)*(rate of convergence). For instance, the breakdown number of our estimators for the problem of estimation of a d-dimensional vector with a noise variance sigma^2 is sigma^2d and it becomes sigma^2 s log(d/s) when this vector has only s non-zero component. Beyond this breakdown point, we proved that the rate of convergence achieved by our estimator is (number of outliers) divided by (number of observation). Besides these theoretical guarantees, the major improvement brought by these new estimators is that they are easily computable in practice. In fact, basically any algorithm used to approximate the standard Empirical Risk Minimizer (or its regularized versions) has a robust version approximating our estimators. As a proof of concept, we study many algorithms for the classical LASSO estimator. A byproduct of the MOM algorithms is a measure of depth of data that can be used to detect outliers.Comment: 48 pages, 6 figure

    A Dual-Dimer Method for Training Physics-Constrained Neural Networks with Minimax Architecture

    Full text link
    Data sparsity is a common issue to train machine learning tools such as neural networks for engineering and scientific applications, where experiments and simulations are expensive. Recently physics-constrained neural networks (PCNNs) were developed to reduce the required amount of training data. However, the weights of different losses from data and physical constraints are adjusted empirically in PCNNs. In this paper, a new physics-constrained neural network with the minimax architecture (PCNN-MM) is proposed so that the weights of different losses can be adjusted systematically. The training of the PCNN-MM is searching the high-order saddle points of the objective function. A novel saddle point search algorithm called Dual-Dimer method is developed. It is demonstrated that the Dual-Dimer method is computationally more efficient than the gradient descent ascent method for nonconvex-nonconcave functions and provides additional eigenvalue information to verify search results. A heat transfer example also shows that the convergence of PCNN-MMs is faster than that of traditional PCNNs.Comment: 34 pages, 5 figures, accepted by neural network

    An Extragradient-type Algorithm for Variational Inequality on Hadamard Manifolds

    Full text link
    The aim of this paper is to present an extragradient method for variational inequality associated to a point-to-set vector field in Hadamard manifolds and to study its convergence properties. In order to present our method the concept of ϵ\epsilon-enlargement of maximal monotone vector fields is used and its lower-semicontinuity is stablished in order to obtain the convergence of the method in this new context

    A survey of sparse representation: algorithms and applications

    Full text link
    Sparse representation has attracted much attention from researchers in fields of signal processing, image processing, computer vision and pattern recognition. Sparse representation also has a good reputation in both theoretical research and practical applications. Many different algorithms have been proposed for sparse representation. The main purpose of this article is to provide a comprehensive study and an updated review on sparse representation and to supply a guidance for researchers. The taxonomy of sparse representation methods can be studied from various viewpoints. For example, in terms of different norm minimizations used in sparsity constraints, the methods can be roughly categorized into five groups: sparse representation with l0l_0-norm minimization, sparse representation with lpl_p-norm (0<<p<<1) minimization, sparse representation with l1l_1-norm minimization and sparse representation with l2,1l_{2,1}-norm minimization. In this paper, a comprehensive overview of sparse representation is provided. The available sparse representation algorithms can also be empirically categorized into four groups: greedy strategy approximation, constrained optimization, proximity algorithm-based optimization, and homotopy algorithm-based sparse representation. The rationales of different algorithms in each category are analyzed and a wide range of sparse representation applications are summarized, which could sufficiently reveal the potential nature of the sparse representation theory. Specifically, an experimentally comparative study of these sparse representation algorithms was presented. The Matlab code used in this paper can be available at: http://www.yongxu.org/lunwen.html.Comment: Published on IEEE Access, Vol. 3, pp. 490-530, 201

    Efficient regularization with wavelet sparsity constraints in PAT

    Full text link
    In this paper we consider the reconstruction problem of photoacoustic tomography (PAT) with a flat observation surface. We develop a direct reconstruction method that employs regularization with wavelet sparsity constraints. To that end, we derive a wavelet-vaguelette decomposition (WVD) for the PAT forward operator and a corresponding explicit reconstruction formula in the case of exact data. In the case of noisy data, we combine the WVD reconstruction formula with soft-thresholding which yields a spatially adaptive estimation method. We demonstrate that our method is statistically optimal for white random noise if the unknown function is assumed to lie in any Besov-ball. We present generalizations of this approach and, in particular, we discuss the combination of vaguelette soft-thresholding with a TV prior. We also provide an efficient implementation of the vaguelette transform that leads to fast image reconstruction algorithms supported by numerical results.Comment: 25 pages, 6 figure

    Optimal rates for zero-order convex optimization: the power of two function evaluations

    Full text link
    We consider derivative-free algorithms for stochastic and non-stochastic convex optimization problems that use only function values rather than gradients. Focusing on non-asymptotic bounds on convergence rates, we show that if pairs of function values are available, algorithms for dd-dimensional optimization that use gradient estimates based on random perturbations suffer a factor of at most d\sqrt{d} in convergence rate over traditional stochastic gradient methods. We establish such results for both smooth and non-smooth cases, sharpening previous analyses that suggested a worse dimension dependence, and extend our results to the case of multiple (m2m \ge 2) evaluations. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, establishing the sharpness of our achievable results up to constant (sometimes logarithmic) factors.Comment: 34 page

    Learning from Comparisons and Choices

    Full text link
    When tracking user-specific online activities, each user's preference is revealed in the form of choices and comparisons. For example, a user's purchase history is a record of her choices, i.e. which item was chosen among a subset of offerings. A user's preferences can be observed either explicitly as in movie ratings or implicitly as in viewing times of news articles. Given such individualized ordinal data in the form of comparisons and choices, we address the problem of collaboratively learning representations of the users and the items. The learned features can be used to predict a user's preference of an unseen item to be used in recommendation systems. This also allows one to compute similarities among users and items to be used for categorization and search. Motivated by the empirical successes of the MultiNomial Logit (MNL) model in marketing and transportation, and also more recent successes in word embedding and crowdsourced image embedding, we pose this problem as learning the MNL model parameters that best explain the data. We propose a convex relaxation for learning the MNL model, and show that it is minimax optimal up to a logarithmic factor by comparing its performance to a fundamental lower bound. This characterizes the minimax sample complexity of the problem, and proves that the proposed estimator cannot be improved upon other than by a logarithmic factor. Further, the analysis identifies how the accuracy depends on the topology of sampling via the spectrum of the sampling graph. This provides a guideline for designing surveys when one can choose which items are to be compared. This is accompanied by numerical simulations on synthetic and real data sets, confirming our theoretical predictions.Comment: 77 pages, 12 figures; added new experiments and references. arXiv admin note: substantial text overlap with arXiv:1506.0794

    Solving Non-Convex Non-Differentiable Min-Max Games using Proximal Gradient Method

    Full text link
    Min-max saddle point games appear in a wide range of applications in machine leaning and signal processing. Despite their wide applicability, theoretical studies are mostly limited to the special convex-concave structure. While some recent works generalized these results to special smooth non-convex cases, our understanding of non-smooth scenarios is still limited. In this work, we study special form of non-smooth min-max games when the objective function is (strongly) convex with respect to one of the player's decision variable. We show that a simple multi-step proximal gradient descent-ascent algorithm converges to ϵ\epsilon-first-order Nash equilibrium of the min-max game with the number of gradient evaluations being polynomial in 1/ϵ1/\epsilon. We will also show that our notion of stationarity is stronger than existing ones in the literature. Finally, we evaluate the performance of the proposed algorithm through adversarial attack on a LASSO estimator
    corecore