476 research outputs found
Distributed Inference for Linear Support Vector Machine
The growing size of modern data brings many new challenges to existing
statistical inference methodologies and theories, and calls for the development
of distributed inferential approaches. This paper studies distributed inference
for linear support vector machine (SVM) for the binary classification task.
Despite a vast literature on SVM, much less is known about the inferential
properties of SVM, especially in a distributed setting. In this paper, we
propose a multi-round distributed linear-type (MDL) estimator for conducting
inference for linear SVM. The proposed estimator is computationally efficient.
In particular, it only requires an initial SVM estimator and then successively
refines the estimator by solving simple weighted least squares problem.
Theoretically, we establish the Bahadur representation of the estimator. Based
on the representation, the asymptotic normality is further derived, which shows
that the MDL estimator achieves the optimal statistical efficiency, i.e., the
same efficiency as the classical linear SVM applying to the entire data set in
a single machine setup. Moreover, our asymptotic result avoids the condition on
the number of machines or data batches, which is commonly assumed in
distributed estimation literature, and allows the case of diverging dimension.
We provide simulation studies to demonstrate the performance of the proposed
MDL estimator.Comment: 50 pages, 11 figure
Simultaneous Inference for Pairwise Graphical Models with Generalized Score Matching
Probabilistic graphical models provide a flexible yet parsimonious framework
for modeling dependencies among nodes in networks. There is a vast literature
on parameter estimation and consistent model selection for graphical models.
However, in many of the applications, scientists are also interested in
quantifying the uncertainty associated with the estimated parameters and
selected models, which current literature has not addressed thoroughly. In this
paper, we propose a novel estimator for statistical inference on edge
parameters in pairwise graphical models based on generalized Hyv\"arinen
scoring rule. Hyv\"arinen scoring rule is especially useful in cases where the
normalizing constant cannot be obtained efficiently in a closed form, which is
a common problem for graphical models, including Ising models and truncated
Gaussian graphical models. Our estimator allows us to perform statistical
inference for general graphical models whereas the existing works mostly focus
on statistical inference for Gaussian graphical models where finding
normalizing constant is computationally tractable. Under mild conditions that
are typically assumed in the literature for consistent estimation, we prove
that our proposed estimator is -consistent and asymptotically normal,
which allows us to construct confidence intervals and build hypothesis tests
for edge parameters. Moreover, we show how our proposed method can be applied
to test hypotheses that involve a large number of model parameters
simultaneously. We illustrate validity of our estimator through extensive
simulation studies on a diverse collection of data-generating processes
Quantile Regression Under Memory Constraint
This paper studies the inference problem in quantile regression (QR) for a
large sample size but under a limited memory constraint, where the memory
can only store a small batch of data of size . A natural method is the
na\"ive divide-and-conquer approach, which splits data into batches of size
, computes the local QR estimator for each batch, and then aggregates the
estimators via averaging. However, this method only works when and
is computationally expensive. This paper proposes a computationally efficient
method, which only requires an initial QR estimator on a small batch of data
and then successively refines the estimator via multiple rounds of
aggregations. Theoretically, as long as grows polynomially in , we
establish the asymptotic normality for the obtained estimator and show that our
estimator with only a few rounds of aggregations achieves the same efficiency
as the QR estimator computed on all the data. Moreover, our result allows the
case that the dimensionality goes to infinity. The proposed method can also
be applied to address the QR problem under distributed computing environment
(e.g., in a large-scale sensor network) or for real-time streaming data
The spatial distribution in infinite dimensional spaces and related quantiles and depths
The spatial distribution has been widely used to develop various
nonparametric procedures for finite dimensional multivariate data. In this
paper, we investigate the concept of spatial distribution for data in infinite
dimensional Banach spaces. Many technical difficulties are encountered in such
spaces that are primarily due to the noncompactness of the closed unit ball. In
this work, we prove some Glivenko-Cantelli and Donsker-type results for the
empirical spatial distribution process in infinite dimensional spaces. The
spatial quantiles in such spaces can be obtained by inverting the spatial
distribution function. A Bahadur-type asymptotic linear representation and the
associated weak convergence results for the sample spatial quantiles in
infinite dimensional spaces are derived. A study of the asymptotic efficiency
of the sample spatial median relative to the sample mean is carried out for
some standard probability distributions in function spaces. The spatial
distribution can be used to define the spatial depth in infinite dimensional
Banach spaces, and we study the asymptotic properties of the empirical spatial
depth in such spaces. We also demonstrate the spatial quantiles and the spatial
depth using some real and simulated functional data.Comment: Published in at http://dx.doi.org/10.1214/14-AOS1226 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Noncrossing Ordinal Classification
Ordinal data are often seen in real applications. Regular multicategory
classification methods are not designed for this data type and a more proper
treatment is needed. We consider a framework of ordinal classification which
pools the results from binary classifiers together. An inherent difficulty of
this framework is that the class prediction can be ambiguous due to boundary
crossing. To fix this issue, we propose a noncrossing ordinal classification
method which materializes the framework by imposing noncrossing constraints. An
asymptotic study of the proposed method is conducted. We show by simulated and
data examples that the proposed method can improve the classification
performance for ordinal data without the ambiguity caused by boundary
crossings.Comment: 32 pages, 9 figures. Accepted for Publication in Statistics and Its
Interfac
From the Support Vector Machine to the Bounded Constraint Machine
The Support Vector Machine (SVM) has been successfully applied for classification problems in many different fields. It was originally proposed using the idea of searching for the maximum separation hyperplane. In this article, in contrast to the criterion of maximum separation, we explore alternative searching criteria which result in the new method, the Bounded Constraint Machine (BCM). Properties and performance of the BCM are explored. To connect the BCM with the SVM, we investigate the Balancing Support Vector Machine (BSVM), which can be viewed as a bridge from the SVM to the BCM. The BCM is shown to be an extreme case of the BSVM. Theoretical properties such as Fisher consistency and asymptotic distributions for coefficients are derived, and the entire solution path of the BSVM is developed. Our numerical results demonstrate how the BSVM and the BCM work compared to the SVM
Recommended from our members
Adaptive Huber Regression.
Big data can easily be contaminated by outliers or contain variables with heavy-tailed distributions, which makes many conventional methods inadequate. To address this challenge, we propose the adaptive Huber regression for robust estimation and inference. The key observation is that the robustification parameter should adapt to the sample size, dimension and moments for optimal tradeoff between bias and robustness. Our theoretical framework deals with heavy-tailed distributions with bounded (1 + δ)-th moment for any δ > 0. We establish a sharp phase transition for robust estimation of regression parameters in both low and high dimensions: when δ ≥ 1, the estimator admits a sub-Gaussian-type deviation bound without sub-Gaussian assumptions on the data, while only a slower rate is available in the regime 0 < δ < 1 and the transition is smooth and optimal. In addition, we extend the methodology to allow both heavy-tailed predictors and observation noise. Simulation studies lend further support to the theory. In a genetic study of cancer cell lines that exhibit heavy-tailedness, the proposed methods are shown to be more robust and predictive
Logistic regression models to predict solvent accessible residues using sequence- and homology-based qualitative and quantitative descriptors applied to a domain-complete X-ray structure learning set
A working example of relative solvent accessibility (RSA) prediction for proteins is presented. Novel logistic regression models with various qualitative descriptors that include amino acid type and quantitative descriptors that include 20- and six-term sequence entropy have been built and validated. A domain-complete learning set of over 1300 proteins is used to fit initial models with various sequence homology descriptors as well as query residue qualitative descriptors. Homology descriptors are derived from BLASTp sequence alignments, whereas the RSA values are determined directly from the crystal structure. The logistic regression models are fitted using dichotomous responses indicating buried or accessible solvent, with binary classifications obtained from the RSA values. The fitted models determine binary predictions of residue solvent accessibility with accuracies comparable to other less computationally intensive methods using the standard RSA threshold criteria 20 and 25% as solvent accessible. When an additional non-homology descriptor describing Lobanov–Galzitskaya residue disorder propensity is included, incremental improvements in accuracy are achieved with 25% threshold accuracies of 76.12 and 74.45% for the Manesh-215 and CASP(8+9) test sets, respectively. Moreover, the described software and the accompanying learning and validation sets allow students and researchers to explore the utility of RSA prediction with simple, physically intuitive models in any number of related applications
Quantile regression approach to conditional mode estimation
In this paper, we consider estimation of the conditional mode of an outcome
variable given regressors. To this end, we propose and analyze a
computationally scalable estimator derived from a linear quantile regression
model and develop asymptotic distributional theory for the estimator.
Specifically, we find that the pointwise limiting distribution is a scale
transformation of Chernoff's distribution despite the presence of regressors.
In addition, we consider analytical and subsampling-based confidence intervals
for the proposed estimator. We also conduct Monte Carlo simulations to assess
the finite sample performance of the proposed estimator together with the
analytical and subsampling confidence intervals. Finally, we apply the proposed
estimator to predicting the net hourly electrical energy output using Combined
Cycle Power Plant Data.Comment: This paper supersedes "On estimation of conditional modes using
multiple quantile regressions" (Hirofumi Ohta and Satoshi Hara,
arXiv:1712.08754
- …