Search CORE

476 research outputs found

Distributed Inference for Linear Support Vector Machine

Author: Chen Xi
Liu Weidong
Wang Xiaozhou
Yang Zhuoyi
Publication venue
Publication date: 20/09/2019
Field of study

The growing size of modern data brings many new challenges to existing statistical inference methodologies and theories, and calls for the development of distributed inferential approaches. This paper studies distributed inference for linear support vector machine (SVM) for the binary classification task. Despite a vast literature on SVM, much less is known about the inferential properties of SVM, especially in a distributed setting. In this paper, we propose a multi-round distributed linear-type (MDL) estimator for conducting inference for linear SVM. The proposed estimator is computationally efficient. In particular, it only requires an initial SVM estimator and then successively refines the estimator by solving simple weighted least squares problem. Theoretically, we establish the Bahadur representation of the estimator. Based on the representation, the asymptotic normality is further derived, which shows that the MDL estimator achieves the optimal statistical efficiency, i.e., the same efficiency as the classical linear SVM applying to the entire data set in a single machine setup. Moreover, our asymptotic result avoids the condition on the number of machines or data batches, which is commonly assumed in distributed estimation literature, and allows the case of diverging dimension. We provide simulation studies to demonstrate the performance of the proposed MDL estimator.Comment: 50 pages, 11 figure

arXiv.org e-Print Archive

Simultaneous Inference for Pairwise Graphical Models with Generalized Score Matching

Author: Gupta Varun
Kolar Mladen
Yu Ming
Publication venue
Publication date: 10/05/2020
Field of study

Probabilistic graphical models provide a flexible yet parsimonious framework for modeling dependencies among nodes in networks. There is a vast literature on parameter estimation and consistent model selection for graphical models. However, in many of the applications, scientists are also interested in quantifying the uncertainty associated with the estimated parameters and selected models, which current literature has not addressed thoroughly. In this paper, we propose a novel estimator for statistical inference on edge parameters in pairwise graphical models based on generalized Hyv\"arinen scoring rule. Hyv\"arinen scoring rule is especially useful in cases where the normalizing constant cannot be obtained efficiently in a closed form, which is a common problem for graphical models, including Ising models and truncated Gaussian graphical models. Our estimator allows us to perform statistical inference for general graphical models whereas the existing works mostly focus on statistical inference for Gaussian graphical models where finding normalizing constant is computationally tractable. Under mild conditions that are typically assumed in the literature for consistent estimation, we prove that our proposed estimator is

\sqrt{n}

-consistent and asymptotically normal, which allows us to construct confidence intervals and build hypothesis tests for edge parameters. Moreover, we show how our proposed method can be applied to test hypotheses that involve a large number of model parameters simultaneously. We illustrate validity of our estimator through extensive simulation studies on a diverse collection of data-generating processes

arXiv.org e-Print Archive

Quantile Regression Under Memory Constraint

Author: Chen Xi
Liu Weidong
Zhang Yichen
Publication venue
Publication date: 18/10/2018
Field of study

This paper studies the inference problem in quantile regression (QR) for a large sample size

n

but under a limited memory constraint, where the memory can only store a small batch of data of size

m

. A natural method is the na\"ive divide-and-conquer approach, which splits data into batches of size

m

, computes the local QR estimator for each batch, and then aggregates the estimators via averaging. However, this method only works when

n=o(m^2)

and is computationally expensive. This paper proposes a computationally efficient method, which only requires an initial QR estimator on a small batch of data and then successively refines the estimator via multiple rounds of aggregations. Theoretically, as long as

n

grows polynomially in

m

, we establish the asymptotic normality for the obtained estimator and show that our estimator with only a few rounds of aggregations achieves the same efficiency as the QR estimator computed on all the data. Moreover, our result allows the case that the dimensionality

p

goes to infinity. The proposed method can also be applied to address the QR problem under distributed computing environment (e.g., in a large-scale sensor network) or for real-time streaming data

arXiv.org e-Print Archive

A dynamic linear model for heteroscedastic LDA under class imbalance

Author: Brusey James
Gaura Elena
Gyamfi Sarfo
Hunt Andrew
Publication venue: 'Elsevier BV'
Publication date: 28/05/2019
Field of study

The spatial distribution in infinite dimensional spaces and related quantiles and depths

Author: Chakraborty Anirvan
Chaudhuri Probal
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 03/07/2014
Field of study

The spatial distribution has been widely used to develop various nonparametric procedures for finite dimensional multivariate data. In this paper, we investigate the concept of spatial distribution for data in infinite dimensional Banach spaces. Many technical difficulties are encountered in such spaces that are primarily due to the noncompactness of the closed unit ball. In this work, we prove some Glivenko-Cantelli and Donsker-type results for the empirical spatial distribution process in infinite dimensional spaces. The spatial quantiles in such spaces can be obtained by inverting the spatial distribution function. A Bahadur-type asymptotic linear representation and the associated weak convergence results for the sample spatial quantiles in infinite dimensional spaces are derived. A study of the asymptotic efficiency of the sample spatial median relative to the sample mean is carried out for some standard probability distributions in function spaces. The spatial distribution can be used to define the spatial depth in infinite dimensional Banach spaces, and we study the asymptotic properties of the empirical spatial depth in such spaces. We also demonstrate the spatial quantiles and the spatial depth using some real and simulated functional data.Comment: Published in at http://dx.doi.org/10.1214/14-AOS1226 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Noncrossing Ordinal Classification

Author: Qiao Xingye
Publication venue
Publication date: 21/12/2015
Field of study

Ordinal data are often seen in real applications. Regular multicategory classification methods are not designed for this data type and a more proper treatment is needed. We consider a framework of ordinal classification which pools the results from binary classifiers together. An inherent difficulty of this framework is that the class prediction can be ambiguous due to boundary crossing. To fix this issue, we propose a noncrossing ordinal classification method which materializes the framework by imposing noncrossing constraints. An asymptotic study of the proposed method is conducted. We show by simulated and data examples that the proposed method can improve the classification performance for ordinal data without the ambiguity caused by boundary crossings.Comment: 32 pages, 9 figures. Accepted for Publication in Statistics and Its Interfac

arXiv.org e-Print Archive

From the Support Vector Machine to the Bounded Constraint Machine

Author: Seo Young Park
Yufeng Liu
Publication venue
Publication date
Field of study

The Support Vector Machine (SVM) has been successfully applied for classification problems in many different fields. It was originally proposed using the idea of searching for the maximum separation hyperplane. In this article, in contrast to the criterion of maximum separation, we explore alternative searching criteria which result in the new method, the Bounded Constraint Machine (BCM). Properties and performance of the BCM are explored. To connect the BCM with the SVM, we investigate the Balancing Support Vector Machine (BSVM), which can be viewed as a bridge from the SVM to the BCM. The BCM is shown to be an extreme case of the BSVM. Theoretical properties such as Fisher consistency and asymptotic distributions for coefficients are derived, and the entire solution path of the BSVM is developed. Our numerical results demonstrate how the BSVM and the BCM work compared to the SVM

CiteSeerX

Recommended from our members

Adaptive Huber Regression.

Author: Fan Jianqing
Sun Qiang
Zhou Wen-Xin
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

Big data can easily be contaminated by outliers or contain variables with heavy-tailed distributions, which makes many conventional methods inadequate. To address this challenge, we propose the adaptive Huber regression for robust estimation and inference. The key observation is that the robustification parameter should adapt to the sample size, dimension and moments for optimal tradeoff between bias and robustness. Our theoretical framework deals with heavy-tailed distributions with bounded (1 + δ)-th moment for any δ > 0. We establish a sharp phase transition for robust estimation of regression parameters in both low and high dimensions: when δ ≥ 1, the estimator admits a sub-Gaussian-type deviation bound without sub-Gaussian assumptions on the data, while only a slower rate is available in the regime 0 < δ < 1 and the transition is smooth and optimal. In addition, we extend the methodology to allow both heavy-tailed predictors and observation noise. Simulation studies lend further support to the theory. In a genetic study of cancer cell lines that exhibit heavy-tailedness, the proposed methods are shown to be more robust and predictive

eScholarship - University of California

Logistic regression models to predict solvent accessible residues using sequence- and homology-based qualitative and quantitative descriptors applied to a domain-complete X-ray structure learning set

Author: Bhogal Guneet
Chung Edwin
Gottlieb Andrea
Kamath Thejas
Kantardjieff Katherine
Lustig Brooke
Nedunuri Amulya
Nepal Reecha
Poelman Thomas
Spencer Joanna
Publication venue: SJSU ScholarWorks
Publication date: 10/11/2015
Field of study

A working example of relative solvent accessibility (RSA) prediction for proteins is presented. Novel logistic regression models with various qualitative descriptors that include amino acid type and quantitative descriptors that include 20- and six-term sequence entropy have been built and validated. A domain-complete learning set of over 1300 proteins is used to fit initial models with various sequence homology descriptors as well as query residue qualitative descriptors. Homology descriptors are derived from BLASTp sequence alignments, whereas the RSA values are determined directly from the crystal structure. The logistic regression models are fitted using dichotomous responses indicating buried or accessible solvent, with binary classifications obtained from the RSA values. The fitted models determine binary predictions of residue solvent accessibility with accuracies comparable to other less computationally intensive methods using the standard RSA threshold criteria 20 and 25% as solvent accessible. When an additional non-homology descriptor describing Lobanov–Galzitskaya residue disorder propensity is included, incremental improvements in accuracy are achieved with 25% threshold accuracies of 76.12 and 74.45% for the Manesh-215 and CASP(8+9) test sets, respectively. Moreover, the described software and the accompanying learning and validation sets allow students and researchers to explore the utility of RSA prediction with simple, physically intuitive models in any number of related applications

SJSU ScholarWorks

Quantile regression approach to conditional mode estimation

Author: Hara Satoshi
Kato Kengo
Ota Hirofumi
Publication venue
Publication date: 29/07/2019
Field of study

In this paper, we consider estimation of the conditional mode of an outcome variable given regressors. To this end, we propose and analyze a computationally scalable estimator derived from a linear quantile regression model and develop asymptotic distributional theory for the estimator. Specifically, we find that the pointwise limiting distribution is a scale transformation of Chernoff's distribution despite the presence of regressors. In addition, we consider analytical and subsampling-based confidence intervals for the proposed estimator. We also conduct Monte Carlo simulations to assess the finite sample performance of the proposed estimator together with the analytical and subsampling confidence intervals. Finally, we apply the proposed estimator to predicting the net hourly electrical energy output using Combined Cycle Power Plant Data.Comment: This paper supersedes "On estimation of conditional modes using multiple quantile regressions" (Hirofumi Ohta and Satoshi Hara, arXiv:1712.08754

arXiv.org e-Print Archive