73 research outputs found

    Optimization with Sparsity-Inducing Penalties

    Get PDF
    Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. They were first dedicated to linear variable selection but numerous extensions have now emerged such as structured sparsity or kernel selection. It turns out that many of the related estimation problems can be cast as convex optimization problems by regularizing the empirical risk with appropriate non-smooth norms. The goal of this paper is to present from a general perspective optimization tools and techniques dedicated to such sparsity-inducing penalties. We cover proximal methods, block-coordinate descent, reweighted ℓ2\ell_2-penalized techniques, working-set and homotopy methods, as well as non-convex formulations and extensions, and provide an extensive set of experiments to compare various algorithms from a computational point of view

    Robust Methods for High-Dimensional Linear Learning

    Full text link
    We propose statistically robust and computationally efficient linear learning methods in the high-dimensional batch setting, where the number of features dd may exceed the sample size nn. We employ, in a generic learning setting, two algorithms depending on whether the considered loss function is gradient-Lipschitz or not. Then, we instantiate our framework on several applications including vanilla sparse, group-sparse and low-rank matrix recovery. This leads, for each application, to efficient and robust learning algorithms, that reach near-optimal estimation rates under heavy-tailed distributions and the presence of outliers. For vanilla ss-sparsity, we are able to reach the slog⁥(d)/ns\log (d)/n rate under heavy-tails and η\eta-corruption, at a computational cost comparable to that of non-robust analogs. We provide an efficient implementation of our algorithms in an open-source Python\mathtt{Python} library called linlearn\mathtt{linlearn}, by means of which we carry out numerical experiments which confirm our theoretical findings together with a comparison to other recent approaches proposed in the literature.Comment: accepted versio

    Local learning by partitioning

    Full text link
    In many machine learning applications data is assumed to be locally simple, where examples near each other have similar characteristics such as class labels or regression responses. Our goal is to exploit this assumption to construct locally simple yet globally complex systems that improve performance or reduce the cost of common machine learning tasks. To this end, we address three main problems: discovering and separating local non-linear structure in high-dimensional data, learning low-complexity local systems to improve performance of risk-based learning tasks, and exploiting local similarity to reduce the test-time cost of learning algorithms. First, we develop a structure-based similarity metric, where low-dimensional non-linear structure is captured by solving a non-linear, low-rank representation problem. We show that this problem can be kernelized, has a closed-form solution, naturally separates independent manifolds, and is robust to noise. Experimental results indicate that incorporating this structural similarity in well-studied problems such as clustering, anomaly detection, and classification improves performance. Next, we address the problem of local learning, where a partitioning function divides the feature space into regions where independent functions are applied. We focus on the problem of local linear classification using linear partitioning and local decision functions. Under an alternating minimization scheme, learning the partitioning functions can be reduced to solving a weighted supervised learning problem. We then present a novel reformulation that yields a globally convex surrogate, allowing for efficient, joint training of the partitioning functions and local classifiers. We then examine the problem of learning under test-time budgets, where acquiring sensors (features) for each example during test-time has a cost. Our goal is to partition the space into regions, with only a small subset of sensors needed in each region, reducing the average number of sensors required per example. Starting with a cascade structure and expanding to binary trees, we formulate this problem as an empirical risk minimization and construct an upper-bounding surrogate that allows for sequential decision functions to be trained jointly by solving a linear program. Finally, we present preliminary work extending the notion of test-time budgets to the problem of adaptive privacy

    Theory and Algorithms for Hypothesis Transfer Learning

    Get PDF
    The design and analysis of machine learning algorithms typically considers the problem of learning on a single task, and the nature of learning in such scenario is well explored. On the other hand, very often tasks faced by machine learning systems arrive sequentially, and therefore it is reasonable to ask whether a better approach can be taken than retraining such systems from scratch given newly available data. Indeed, by drawing analogy from human learning, a novel skill could be acquired more easily whenever the learner shares a relevant past experience. In response to this observation, the machine learning community has drawn its attention towards a form of learning known as transfer learning - learning a novel task by leveraging upon auxiliary information extracted from previous tasks. Tangible progress has been made in both theory and practice of transfer learning; however, many questions are still to be addressed. In this thesis we will focus on an efficient type of transfer learning, known as the Hypothesis Transfer Learning (HTL), where auxiliary information is retained in a form of previously induced hypotheses. This is in contrast to the large body of work where one transfers from the data associated with previously encountered tasks. In particular, we theoretically investigate conditions when HTL guarantees improved generalization on a novel task subject to the relevant auxiliary (source) hypotheses. We investigate HTL theoretically by considering three scenarios: HTL through regularized least squares with biased regularization, through convex empirical risk minimization, and through stochastic optimization, which also touches the theory of non-convex transfer learning problems. In addition, we demonstrate the benefits of HTL empirically, by proposing two algorithms tailored for real-life situations with application to visual learning problems - learning a new class in a multi-class classification setting by transferring from known classes, and an efficient greedy HTL algorithm for learning with large number of source hypotheses. From theoretical point of view this thesis consistently identifies the key quantitative characteristics of relatedness between novel and previous tasks, and explicitates them in generalization bounds. These findings corroborate many previous works in the transfer learning literature and provide a theoretical basis for design and analysis of new HTL algorithms

    Characterizing model uncertainty in ensemble learning

    Get PDF

    Constrained Learning And Inference

    Get PDF
    Data and learning have become core components of the information processing and autonomous systems upon which we increasingly rely on to select job applicants, analyze medical data, and drive cars. As these systems become ubiquitous, so does the need to curtail their behavior. Left untethered, they are susceptible to tampering (adversarial examples) and prone to prejudiced and unsafe actions. Currently, the response of these systems is tailored by leveraging domain expert knowledge to either construct models that embed the desired properties or tune the training objective so as to promote them. While effective, these solutions are often targeted to specific behaviors, contexts, and sometimes even problem instances and are typically not transferable across models and applications. What is more, the growing scale and complexity of modern information processing and autonomous systems renders this manual behavior tuning infeasible. Already today, explainability, interpretability, and transparency combined with human judgment are no longer enough to design systems that perform according to specifications. The present thesis addresses these issues by leveraging constrained statistical optimization. More specifically, it develops the theoretical underpinnings of constrained learning and constrained inference to provide tools that enable solving statistical problems under requirements. Starting with the task of learning under requirements, it develops a generalization theory of constrained learning akin to the existing unconstrained one. By formalizing the concept of probability approximately correct constrained (PACC) learning, it shows that constrained learning is as hard as its unconstrained learning and establishes the constrained counterpart of empirical risk minimization (ERM) as a PACC learner. To overcome challenges involved in solving such non-convex constrained optimization problems, it derives a dual learning rule that enables constrained learning tasks to be tackled by through unconstrained learning problems only. It therefore concludes that if we can deal with classical, unconstrained learning tasks, then we can deal with learning tasks with requirements. The second part of this thesis addresses the issue of constrained inference. In particular, the issue of performing inference using sparse nonlinear function models, combinatorial constrained with quadratic objectives, and risk constraints. Such models arise in nonlinear line spectrum estimation, functional data analysis, sensor selection, actuator scheduling, experimental design, and risk-aware estimation. Although inference problems assume that models and distributions are known, each of these constraints pose serious challenges that hinder their use in practice. Sparse nonlinear functional models lead to infinite dimensional, non-convex optimization programs that cannot be discretized without leading to combinatorial, often NP-hard, problems. Rather than using surrogates and relaxations, this work relies on duality to show that despite their apparent complexity, these models can be fit efficiently, i.e., in polynomial time. While quadratic objectives are typically tractable (often even in closed form), they lead to non-submodular optimization problems when subject to cardinality or matroid constraints. While submodular functions are sometimes used as surrogates, this work instead shows that quadratic functions are close to submodular and can also be optimized near-optimally. The last chapter of this thesis is dedicated to problems involving risk constraints, in particular, bounded predictive mean square error variance estimation. Despite being non-convex, such problems are equivalent to a quadratically constrained quadratic program from which a closed-form estimator can be extracted. These results are used throughout this thesis to tackle problems in signal processing, machine learning, and control, such as fair learning, robust learning, nonlinear line spectrum estimation, actuator scheduling, experimental design, and risk-aware estimation. Yet, they are applicable much beyond these illustrations to perform safe reinforcement learning, sensor selection, multiresolution kernel estimation, and wireless resource allocation, to name a few

    Subspace Representations and Learning for Visual Recognition

    Get PDF
    Pervasive and affordable sensor and storage technology enables the acquisition of an ever-rising amount of visual data. The ability to extract semantic information by interpreting, indexing and searching visual data is impacting domains such as surveillance, robotics, intelligence, human- computer interaction, navigation, healthcare, and several others. This further stimulates the investigation of automated extraction techniques that are more efficient, and robust against the many sources of noise affecting the already complex visual data, which is carrying the semantic information of interest. We address the problem by designing novel visual data representations, based on learning data subspace decompositions that are invariant against noise, while being informative for the task at hand. We use this guiding principle to tackle several visual recognition problems, including detection and recognition of human interactions from surveillance video, face recognition in unconstrained environments, and domain generalization for object recognition.;By interpreting visual data with a simple additive noise model, we consider the subspaces spanned by the model portion (model subspace) and the noise portion (variation subspace). We observe that decomposing the variation subspace against the model subspace gives rise to the so-called parity subspace. Decomposing the model subspace against the variation subspace instead gives rise to what we name invariant subspace. We extend the use of kernel techniques for the parity subspace. This enables modeling the highly non-linear temporal trajectories describing human behavior, and performing detection and recognition of human interactions. In addition, we introduce supervised low-rank matrix decomposition techniques for learning the invariant subspace for two other tasks. We learn invariant representations for face recognition from grossly corrupted images, and we learn object recognition classifiers that are invariant to the so-called domain bias.;Extensive experiments using the benchmark datasets publicly available for each of the three tasks, show that learning representations based on subspace decompositions invariant to the sources of noise lead to results comparable or better than the state-of-the-art
    • 

    corecore