21 research outputs found

    Applications of empirical processes in learning theory : algorithmic stability and generalization bounds

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Brain and Cognitive Sciences, 2006.Includes bibliographical references (p. 141-148).This thesis studies two key properties of learning algorithms: their generalization ability and their stability with respect to perturbations. To analyze these properties, we focus on concentration inequalities and tools from empirical process theory. We obtain theoretical results and demonstrate their applications to machine learning. First, we show how various notions of stability upper- and lower-bound the bias and variance of several estimators of the expected performance for general learning algorithms. A weak stability condition is shown to be equivalent to consistency of empirical risk minimization. The second part of the thesis derives tight performance guarantees for greedy error minimization methods - a family of computationally tractable algorithms. In particular, we derive risk bounds for a greedy mixture density estimation procedure. We prove that, unlike what is suggested in the literature, the number of terms in the mixture is not a bias-variance trade-off for the performance. The third part of this thesis provides a solution to an open problem regarding the stability of Empirical Risk Minimization (ERM). This algorithm is of central importance in Learning Theory.(cont.) By studying the suprema of the empirical process, we prove that ERM over Donsker classes of functions is stable in the L1 norm. Hence, as the number of samples grows, it becomes less and less likely that a perturbation of o(v/n) samples will result in a very different empirical minimizer. Asymptotic rates of this stability are proved under metric entropy assumptions on the function class. Through the use of a ratio limit inequality, we also prove stability of expected errors of empirical minimizers. Next, we investigate applications of the stability result. In particular, we focus on procedures that optimize an objective function, such as k-means and other clustering methods. We demonstrate that stability of clustering, just like stability of ERM, is closely related to the geometry of the class and the underlying measure. Furthermore, our result on stability of ERM delineates a phase transition between stability and instability of clustering methods. In the last chapter, we prove a generalization of the bounded-difference concentration inequality for almost-everywhere smooth functions. This result can be utilized to analyze algorithms which are almost always stable. Next, we prove a phase transition in the concentration of almost-everywhere smooth functions. Finally, a tight concentration of empirical errors of empirical minimizers is shown under an assumption on the underlying space.by Alexander Rakhlin.Ph.D

    Unsupervised learning in high-dimensional space

    Full text link
    Thesis (Ph.D.)--Boston UniversityIn machine learning, the problem of unsupervised learning is that of trying to explain key features and find hidden structures in unlabeled data. In this thesis we focus on three unsupervised learning scenarios: graph based clustering with imbalanced data, point-wise anomaly detection and anomalous cluster detection on graphs. In the first part we study spectral clustering, a popular graph based clustering technique. We investigate the reason why spectral clustering performs badly on imbalanced and proximal data. We then propose the partition constrained minimum cut (PCut) framework based on a novel parametric graph construction method, that is shown to adapt to different degrees of imbalanced data. We analyze the limit cut behavior of our approach, and demonstrate the significant performance improvement through clustering and semi-supervised learning experiments on imbalanced data. [TRUNCATED

    Statistical properties of Kernel Prinicipal Component Analysis

    Get PDF
    International audienceWe study the properties of the eigenvalues of Gram matrices in a non-asymptotic setting. Using local Rademacher averages, we provide data-dependent and tight bounds for their convergence towards eigenvalues of the corresponding kernel operator. We perform these computations in a functional analytic framework which allows to deal implicitly with reproducing kernel Hilbert spaces of infinite dimension. This can have applications to various kernel algorithms, such as Support Vector Machines (SVM). We focus on Kernel Principal Component Analysis (KPCA) and, using such techniques, we obtain sharp excess risk bounds for the reconstruction error. In these bounds, the dependence on the decay of the spectrum and on the closeness of successive eigenvalues is made explicit

    In Search of Non-Gaussian Components of a High-Dimensional Distribution

    Get PDF
    Finding non-Gaussian components of high-dimensional data is an important preprocessing step for effcient information processing. This article proposes a new linear method to identify the ``non-Gaussian subspace´´ within a very general semi-parametric framework. Our proposed method, called NGCA (Non-Gaussian Component Analysis), is essentially based on a linear operator which, to any arbitrary nonlinear (smooth) function, associates a vector which belongs to the low dimensional non-Gaussian target subspace up to an estimation error. By applying this operator to a family of different nonlinear functions, one obtains a family of different vectors lying in a vicinity of the target space. As a final step, the target space itself is estimated by applying PCA to this family of vectors. We show that this procedure is consistent in the sense that the estimaton error tends to zero at a parametric rate, uniformly over the family, Numerical examples demonstrate the usefulness of our method.non-Gaussian components, dimension reduction

    A learning method for the approximation of discontinuous functions for stochastic simulations

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Aeronautics and Astronautics, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 79-83).Surrogate models for computational simulations are inexpensive input-output approximations that allow expensive analyses, such as the forward propagation of uncertainty and Bayesian statistical inference, to be performed efficiently. When a simulation output does not depend smoothly on its inputs, however, most existing surrogate construction methodologies yield large errors and slow convergence rates. This thesis develops a new methodology for approximating simulation outputs that depend discontinuously on input parameters. Our approach focuses on piecewise smooth outputs and involves two stages: first, efficient detection and localization of discontinuities in high-dimensional parameter spaces using polynomial annihilation, support vector machine classification, and uncertainty sampling; second, approximation of the output on each region using Gaussian process regression. The discontinuity detection methodology is illustrated on examples of up to 11 dimensions, including algebraic models and ODE systems, demonstrating improved scaling and efficiency over other methods found in the literature. Finally, the complete surrogate construction approach is demonstrated on two physical models exhibiting canonical discontinuities: shock formation in Burgers' equation and autoignition in hydrogen-oxygen combustion.by Alex Arkady Gorodetsky.S.M

    Data-driven optimization and analytics for operations management applications

    Get PDF
    Thesis: Ph. D., Massachusetts Institute of Technology, Sloan School of Management, Operations Research Center, 2013.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (pages 163-166).In this thesis, we study data-driven decision making in operation management contexts, with a focus on both theoretical and practical aspects. The first part of the thesis analyzes the well-known newsvendor model but under the assumption that, even though demand is stochastic, its probability distribution is not part of the input. Instead, the only information available is a set of independent samples drawn from the demand distribution. We analyze the well-known sample average approximation (SAA) approach, and obtain new tight analytical bounds on the accuracy of the SAA solution. Unlike previous work, these bounds match the empirical performance of SAA observed in extensive computational experiments. Our analysis reveals that a distribution's weighted mean spread (WMS) impacts SAA accuracy. Furthermore, we are able to derive distribution parametric free bound on SAA accuracy for log-concave distributions through an innovative optimization-based analysis which minimizes WMS over the distribution family. In the second part of the thesis, we use spread information to introduce new families of demand distributions under the minimax regret framework. We propose order policies that require only a distribution's mean and spread information. These policies have several attractive properties. First, they take the form of simple closed-form expressions. Second, we can quantify an upper bound on the resulting regret. Third, under an environment of high profit margins, they are provably near-optimal under mild technical assumptions on the failure rate of the demand distribution. And finally, the information that they require is easy to estimate with data. We show in extensive numerical simulations that when profit margins are high, even if the information in our policy is estimated from (sometimes few) samples, they often manage to capture at least 99% of the optimal expected profit. The third part of the thesis describes both applied and analytical work in collaboration with a large multi-state gas utility. We address a major operational resource allocation problem in which some of the jobs are scheduled and known in advance, and some are unpredictable and have to be addressed as they appear. We employ a novel decomposition approach that solves the problem in two phases. The first is a job scheduling phase, where regular jobs are scheduled over a time horizon. The second is a crew assignment phase, which assigns jobs to maintenance crews under a stochastic number of future emergencies. We propose heuristics for both phases using linear programming relaxation and list scheduling. Using our models, we develop a decision support tool for the utility which is currently being piloted in one of the company's sites. Based on the utility's data, we project that the tool will result in 55% reduction in overtime hours.by Joline Ann Villaranda Uichanco.Ph. D

    Learning Non-Parametric and High-Dimensional Distributions via Information-Theoretic Methods

    Get PDF
    Learning distributions that govern generation of data and estimation of related functionals are the foundations of many classical statistical problems. In the following dissertation we intend to investigate such topics when either the hypothesized model is non-parametric or the number of free parameters in the model grows along with the sample size. Especially, we study the above scenarios for the following class of problems with the goal of obtaining minimax rate-optimal methods for learning the target distributions when the sample size is finite. Our techniques are based on information-theoretic divergences and related mutual-information based methods. (i) Estimation in compound decision and empirical Bayes settings: To estimate the data-generating distribution, one often takes the following two-step approach. In the first step the statistician estimates the distribution of the parameters, either the empirical distribution or the postulated prior, and then in the second step plugs in the estimate to approximate the target of interest. In the literature, the estimation of empirical distribution is known as the compound decision problem and the estimation of prior is known as the problem of empirical Bayes. In our work we use the method of minimum-distance estimation for approximating these distributions. Considering certain discrete data setups, we show that the minimum-distance based method provides theoretically and practically sound choices for estimation. The computational and algorithmic aspects of the estimators are also analyzed. (ii) Prediction with Markov chains: Given observations from an unknown Markov chain, we study the problem of predicting the next entry in the trajectory. Existing analysis for such a dependent setup usually centers around concentration inequalities that uses various extraneous conditions on the mixing properties. This makes it difficult to achieve results independent of such restrictions. We introduce information-theoretic techniques to bypass such issues and obtain fundamental limits for the related minimax problems. We also analyze conditions on the mixing properties that produce a parametric rate of prediction errors

    Risk and robust optimization

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 203-213).This thesis develops and explores the connections between risk theory and robust optimization. Specifically, we show that there is a one-to-one correspondence between a class of risk measures known as coherent risk measures and uncertainty sets in robust optimization. An important consequence of this is that one may construct uncertainty sets, which are the critical primitives of robust optimization, using decision-maker risk preferences. In addition, we show some results on the geometry of such uncertainty sets. We also consider a more general class of risk measures known as convex risk measures, and show that these risk measures lead to a more flexible approach to robust optimization. In particular, these models allow one to specify not only the values of the uncertain parameters for which feasibility should be ensured, but also the degree of feasibility. We show that traditional, robust optimization models are a special case of this framework. As a result, this framework implies a family of probability guarantees on infeasibility at different levels, as opposed to standard, robust approaches which generally imply a single guarantee.(cont.) Furthermore, we illustrate the performance of these risk measures on a real-world portfolio optimization application and show promising results that our methodology can, in some cases, yield significant improvements in downside risk protection at little or no expense in expected performance over traditional methods. While we develop this framework for tile case of linear optimization under uncertainty, we show how to extend the results to optimization over more general cones. Moreover, our methodology is scenario-based, and( we prove a new rate of convergence result on a specific class of convex risk measures. Finally, we consider a multi-stage problem under uncertainty, specifically optimization of quadratic functions over un-certain linear systems. Although the theory of risk measures is still undeveloped with respect to dynamic optimization problems. we show that a set-based model of uncertainty yields a tractable approach to this problem in the presence of constraints. Moreover, we are able to derive a near-closed form solution for this approach and prove new probability guarantees on its resulting performance.by David Benjamin Brown.Ph.D
    corecore