3,598 research outputs found
A Statistical Perspective on Algorithmic Leveraging
One popular method for dealing with large-scale data sets is sampling. For
example, by using the empirical statistical leverage scores as an importance
sampling distribution, the method of algorithmic leveraging samples and
rescales rows/columns of data matrices to reduce the data size before
performing computations on the subproblem. This method has been successful in
improving computational efficiency of algorithms for matrix problems such as
least-squares approximation, least absolute deviations approximation, and
low-rank matrix approximation. Existing work has focused on algorithmic issues
such as worst-case running times and numerical issues associated with providing
high-quality implementations, but none of it addresses statistical aspects of
this method.
In this paper, we provide a simple yet effective framework to evaluate the
statistical properties of algorithmic leveraging in the context of estimating
parameters in a linear regression model with a fixed number of predictors. We
show that from the statistical perspective of bias and variance, neither
leverage-based sampling nor uniform sampling dominates the other. This result
is particularly striking, given the well-known result that, from the
algorithmic perspective of worst-case analysis, leverage-based sampling
provides uniformly superior worst-case algorithmic results, when compared with
uniform sampling. Based on these theoretical results, we propose and analyze
two new leveraging algorithms. A detailed empirical evaluation of existing
leverage-based methods as well as these two new methods is carried out on both
synthetic and real data sets. The empirical results indicate that our theory is
a good predictor of practical performance of existing and new leverage-based
algorithms and that the new algorithms achieve improved performance.Comment: 44 pages, 17 figure
Improved model identification for nonlinear systems using a random subsampling and multifold modelling (RSMM) approach
In nonlinear system identification, the available observed data are conventionally partitioned into two parts: the training data that are used for model identification and the test data that are used for model performance testing. This sort of ‘hold-out’ or ‘split-sample’ data partitioning
method is convenient and the associated model identification procedure is in general easy to implement. The resultant model obtained from such a once-partitioned single training dataset, however, may occasionally lack robustness and generalisation to represent future unseen data, because the performance of the identified model may be highly dependent on how the data partition is made. To
overcome the drawback of the hold-out data partitioning method, this study presents a new random subsampling and multifold modelling (RSMM) approach to produce less biased or preferably unbiased models. The basic idea and the associated procedure are as follows. Firstly, generate K training datasets (and also K validation datasets), using a K-fold random subsampling method. Secondly, detect
significant model terms and identify a common model structure that fits all the K datasets using a new
proposed common model selection approach, called the multiple orthogonal search algorithm. Finally,
estimate and refine the model parameters for the identified common-structured model using a multifold parameter estimation method. The proposed method can produce robust models with better generalisation performance
Improved model identification for non-linear systems using a random subsampling and multifold modelling (RSMM) approach
In non-linear system identification, the available observed data are conventionally partitioned into two parts: the training data that are used for model identification and the test data that are used for model performance testing. This sort of 'hold-out' or 'split-sample' data partitioning method is convenient and the associated model identification procedure is in general easy to implement. The resultant model obtained from such a once-partitioned single training dataset, however, may occasionally lack robustness and generalisation to represent future unseen data, because the performance of the identified model may be highly dependent on how the data partition is made. To overcome the drawback of the hold-out data partitioning method, this study presents a new random subsampling and multifold modelling (RSMM) approach to produce less biased or preferably unbiased models. The basic idea and the associated procedure are as follows. First, generate K training datasets (and also K validation datasets), using a K-fold random subsampling method. Secondly, detect significant model terms and identify a common model structure that fits all the K datasets using a new proposed common model selection approach, called the multiple orthogonal search algorithm. Finally, estimate and refine the model parameters for the identified common-structured model using a multifold parameter estimation method. The proposed method can produce robust models with better generalisation performance
Random Projections For Large-Scale Regression
Fitting linear regression models can be computationally very expensive in
large-scale data analysis tasks if the sample size and the number of variables
are very large. Random projections are extensively used as a dimension
reduction tool in machine learning and statistics. We discuss the applications
of random projections in linear regression problems, developed to decrease
computational costs, and give an overview of the theoretical guarantees of the
generalization error. It can be shown that the combination of random
projections with least squares regression leads to similar recovery as ridge
regression and principal component regression. We also discuss possible
improvements when averaging over multiple random projections, an approach that
lends itself easily to parallel implementation.Comment: 13 pages, 3 Figure
Less is More: Nystr\"om Computational Regularization
We study Nystr\"om type subsampling approaches to large scale kernel methods,
and prove learning bounds in the statistical learning setting, where random
sampling and high probability estimates are considered. In particular, we prove
that these approaches can achieve optimal learning bounds, provided the
subsampling level is suitably chosen. These results suggest a simple
incremental variant of Nystr\"om Kernel Regularized Least Squares, where the
subsampling level implements a form of computational regularization, in the
sense that it controls at the same time regularization and computations.
Extensive experimental analysis shows that the considered approach achieves
state of the art performances on benchmark large scale datasets.Comment: updated version of NIPS 2015 (oral
Random projections for Bayesian regression
This article deals with random projections applied as a data reduction
technique for Bayesian regression analysis. We show sufficient conditions under
which the entire -dimensional distribution is approximately preserved under
random projections by reducing the number of data points from to in the case . Under mild
assumptions, we prove that evaluating a Gaussian likelihood function based on
the projected data instead of the original data yields a
-approximation in terms of the Wasserstein
distance. Our main result shows that the posterior distribution of Bayesian
linear regression is approximated up to a small error depending on only an
-fraction of its defining parameters. This holds when using
arbitrary Gaussian priors or the degenerate case of uniform distributions over
for . Our empirical evaluations involve different
simulated settings of Bayesian linear regression. Our experiments underline
that the proposed method is able to recover the regression model up to small
error while considerably reducing the total running time
- …