2,592 research outputs found

    Predictability, Stability, and Computability of Locally Learnt SVMs

    Get PDF
    We will have a look at the principles predictability, stability, and computability in the field of support vector machines. Support vector machines (SVMs), well-known in machine learning, play a successful role in classification and regression in many areas of science. In the past three decades, much research has been conducted on the statistical and computational properties of support vector machines and related kernel methods. On the one hand, consistency (predictability) and robustness (stability) of the method are of interest. On the other hand, from an applied point of view, there is interest in a method that can deal with many observations and many features (computability). Since SVMs require a lot of computing power and storage capacity, various possibilities for processing large data sets have been proposed. One of them is called regionalization. It divides the space of declaring variables into possibly overlapping domains in a data driven way and defines the function to predict the output by the formation of locally learnt support vector machines. Another advantage of regionalization should be mentioned. If the generating distribution in different regions of the input space has different characteristics, learning only one “global” SVM may lead to an imprecise estimate. Locally trained predictors can overcome this problem. It is possible to show that a locally learnt predictor is consistent and robust under assumptions that can be checked by the user of this method

    Gini Covariance Matrix and its Affine Equivariant Version

    Get PDF
    Gini\u27s mean difference (GMD) and its derivatives such as Gini index have been widely used as alternative measures of variability over one century in many research fields especially in finance, economics and social welfare. In this dissertation, we generalize the univariate GMD to the multivariate case and propose a new covariance matrix so called the Gini covariance matrix (GCM). The extension is natural, which is based on the covariance representation of GMD with the notion of multivariate spatial rank function. In order to gain the affine equivariance property for GCM, we utilize the transformation-retransformation (TR) technique and obtain TR version GCM that turns out to be a symmetrized M-functional. Indeed, both GCMs are symmetrized approaches based on the difference of two independent variables without reference of a location, hence avoiding some arbitrary definition of location for non-symmetric distributions. We study the properties of both GCMs. They possess the so-called independence property, which is highly important, for example, in independent component analysis. Influence functions of two GCMs are derived to assess their robustness. They are found to be more robust than the regular covariance matrix but less robust than Tyler and Dümbgen M-functional. Under elliptical distributions, the relationship between the scatter parameter and the two GCM are obtained. With this relationship, principal component analysis (PCA) based on GCM is possible. Estimation of two GCMs is presented. We study asymptotical behavior of the estimators. √n-consistency and asymptotical normality of estimators are established. Asymptotic relative efficiency (ARE) of TR-GCM estimator with respect to sample covariance matrix is compared to that of Tyler and Dümbgen M-estimators. With little loss on efficiency (\u3c 2%) in the normal case, it gains high efficiency for heavy-tailed distributions. Finite sample behavior of Gini estimators is explored under various models using two criteria. As a by-product, a closely related scatter Kotz functional and its estimator are also studied. The proposed Gini covariance balances well between efficiency and robustness. In applications, we implement the Gini-based PCA to two real data sets from UCI machine learning repository. Relying on some graphical and numerical summaries, Gini-based PCA demonstrates its competitive performance

    Trimmed Density Ratio Estimation

    Full text link
    Density ratio estimation is a vital tool in both machine learning and statistical community. However, due to the unbounded nature of density ratio, the estimation procedure can be vulnerable to corrupted data points, which often pushes the estimated ratio toward infinity. In this paper, we present a robust estimator which automatically identifies and trims outliers. The proposed estimator has a convex formulation, and the global optimum can be obtained via subgradient descent. We analyze the parameter estimation error of this estimator under high-dimensional settings. Experiments are conducted to verify the effectiveness of the estimator.Comment: Made minor revisions. Restructured the introductory section

    Distributed Adaptive Huber Regression

    Full text link
    Distributed data naturally arise in scenarios involving multiple sources of observations, each stored at a different location. Directly pooling all the data together is often prohibited due to limited bandwidth and storage, or due to privacy protocols. This paper introduces a new robust distributed algorithm for fitting linear regressions when data are subject to heavy-tailed and/or asymmetric errors with finite second moments. The algorithm only communicates gradient information at each iteration and therefore is communication-efficient. Statistically, the resulting estimator achieves the centralized nonasymptotic error bound as if all the data were pooled together and came from a distribution with sub-Gaussian tails. Under a finite (2+δ)(2+\delta)-th moment condition, we derive a Berry-Esseen bound for the distributed estimator, based on which we construct robust confidence intervals. Numerical studies further confirm that compared with extant distributed methods, the proposed methods achieve near-optimal accuracy with low variability and better coverage with tighter confidence width.Comment: 29 page

    Sever: A Robust Meta-Algorithm for Stochastic Optimization

    Full text link
    In high dimensions, most machine learning methods are brittle to even a small fraction of structured outliers. To address this, we introduce a new meta-algorithm that can take in a base learner such as least squares or stochastic gradient descent, and harden the learner to be resistant to outliers. Our method, Sever, possesses strong theoretical guarantees yet is also highly scalable -- beyond running the base learner itself, it only requires computing the top singular vector of a certain n×dn \times d matrix. We apply Sever on a drug design dataset and a spam classification dataset, and find that in both cases it has substantially greater robustness than several baselines. On the spam dataset, with 1%1\% corruptions, we achieved 7.4%7.4\% test error, compared to 13.4%20.5%13.4\%-20.5\% for the baselines, and 3%3\% error on the uncorrupted dataset. Similarly, on the drug design dataset, with 10%10\% corruptions, we achieved 1.421.42 mean-squared error test error, compared to 1.511.51-2.332.33 for the baselines, and 1.231.23 error on the uncorrupted dataset.Comment: To appear in ICML 201
    corecore