1,800 research outputs found
Maximin effects in inhomogeneous large-scale data
Large-scale data are often characterized by some degree of inhomogeneity as
data are either recorded in different time regimes or taken from multiple
sources. We look at regression models and the effect of randomly changing
coefficients, where the change is either smoothly in time or some other
dimension or even without any such structure. Fitting varying-coefficient
models or mixture models can be appropriate solutions but are computationally
very demanding and often return more information than necessary. If we just ask
for a model estimator that shows good predictive properties for all regimes of
the data, then we are aiming for a simple linear model that is reliable for all
possible subsets of the data. We propose the concept of "maximin effects" and a
suitable estimator and look at its prediction accuracy from a theoretical point
of view in a mixture model with known or unknown group structure. Under certain
circumstances the estimator can be computed orders of magnitudes faster than
standard penalized regression estimators, making computations on large-scale
data feasible. Empirical examples complement the novel methodology and theory.Comment: Published at http://dx.doi.org/10.1214/15-AOS1325 in the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Statistical paleoclimate reconstructions via Markov random fields
Understanding centennial scale climate variability requires data sets that
are accurate, long, continuous and of broad spatial coverage. Since
instrumental measurements are generally only available after 1850, temperature
fields must be reconstructed using paleoclimate archives, known as proxies.
Various climate field reconstructions (CFR) methods have been proposed to
relate past temperature to such proxy networks. In this work, we propose a new
CFR method, called GraphEM, based on Gaussian Markov random fields embedded
within an EM algorithm. Gaussian Markov random fields provide a natural and
flexible framework for modeling high-dimensional spatial fields. At the same
time, they provide the parameter reduction necessary for obtaining precise and
well-conditioned estimates of the covariance structure, even in the
sample-starved setting common in paleoclimate applications. In this paper, we
propose and compare the performance of different methods to estimate the
graphical structure of climate fields, and demonstrate how the GraphEM
algorithm can be used to reconstruct past climate variations. The performance
of GraphEM is compared to the widely used CFR method RegEM with regularization
via truncated total least squares, using synthetic data. Our results show that
GraphEM can yield significant improvements, with uniform gains over space, and
far better risk properties. We demonstrate that the spatial structure of
temperature fields can be well estimated by graphs where each neighbor is only
connected to a few geographically close neighbors, and that the increase in
performance is directly related to recovering the underlying sparsity in the
covariance of the spatial field. Our work demonstrates how significant
improvements can be made in climate reconstruction methods by better modeling
the covariance structure of the climate field.Comment: Published at http://dx.doi.org/10.1214/14-AOAS794 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Machine Learning and Materials Informatics: Recent Applications and Prospects
Propelled partly by the Materials Genome Initiative, and partly by the
algorithmic developments and the resounding successes of data-driven efforts in
other domains, informatics strategies are beginning to take shape within
materials science. These approaches lead to surrogate machine learning models
that enable rapid predictions based purely on past data rather than by direct
experimentation or by computations/simulations in which fundamental equations
are explicitly solved. Data-centric informatics methods are becoming useful to
determine material properties that are hard to measure or compute using
traditional methods--due to the cost, time or effort involved--but for which
reliable data either already exists or can be generated for at least a subset
of the critical cases. Predictions are typically interpolative, involving
fingerprinting a material numerically first, and then following a mapping
(established via a learning algorithm) between the fingerprint and the property
of interest. Fingerprints may be of many types and scales, as dictated by the
application domain and needs. Predictions may also be extrapolative--extending
into new materials spaces--provided prediction uncertainties are properly taken
into account. This article attempts to provide an overview of some of the
recent successful data-driven "materials informatics" strategies undertaken in
the last decade, and identifies some challenges the community is facing and
those that should be overcome in the near future
Precise Performance Analysis of the Box-Elastic Net under Matrix Uncertainties
In this letter, we consider the problem of recovering an unknown sparse
signal from noisy linear measurements, using an enhanced version of the popular
Elastic-Net (EN) method. We modify the EN by adding a box-constraint, and we
call it the Box-Elastic Net (Box-EN). We assume independent identically
distributed (iid) real Gaussian measurement matrix with additive Gaussian
noise. In many practical situations, the measurement matrix is not perfectly
known, and so we only have a noisy estimate of it. In this work, we precisely
characterize the mean squared error and the probability of support recovery of
the Box-Elastic Net in the high-dimensional asymptotic regime. Numerical
simulations validate the theoretical predictions derived in the paper and also
show that the boxed variant outperforms the standard EN.Comment: arXiv admin note: text overlap with arXiv:1808.0430
Anti-Sampling-Distortion Compressive Wideband Spectrum Sensing for Cognitive Radio
Too high sampling rate is the bottleneck to wideband spectrum sensing for
cognitive radio in mobile communication. Compressed sensing (CS) is introduced
to transfer the sampling burden. The standard sparse signal recovery of CS does
not consider the distortion in the analogue-to-information converter (AIC). To
mitigate performance degeneration casued by the mismatch in least square
distortionless constraint which doesn't consider the AIC distortion, we define
the sparse signal with the sampling distortion as a bounded additive noise, and
An anti-sampling-distortion constraint (ASDC) is deduced. Then we combine the
\ell1 norm based sparse constraint with the ASDC to get a novel robust sparse
signal recovery operator with sampling distortion. Numerical simulations
demonstrate that the proposed method outperforms standard sparse wideband
spectrum sensing in accuracy, denoising ability, etc.Comment: 24 pages, 4 figures, 1 table; accepted by International Journal of
Mobile Communication
Statistical inference for high dimensional regression via Constrained Lasso
In this paper, we propose a new method for estimation and constructing
confidence intervals for low-dimensional components in a high-dimensional
model. The proposed estimator, called Constrained Lasso (CLasso) estimator, is
obtained by simultaneously solving two estimating equations---one imposing a
zero-bias constraint for the low-dimensional parameter and the other forming an
-penalized procedure for the high-dimensional nuisance parameter. By
carefully choosing the zero-bias constraint, the resulting estimator of the low
dimensional parameter is shown to admit an asymptotically normal limit
attaining the Cram\'{e}r-Rao lower bound in a semiparametric sense. We propose
a tuning-free iterative algorithm for implementing the CLasso. We show that
when the algorithm is initialized at the Lasso estimator, the de-sparsified
estimator proposed in van de Geer et al. [\emph{Ann. Statist.} {\bf 42} (2014)
1166--1202] is asymptotically equivalent to the first iterate of the algorithm.
We analyse the asymptotic properties of the CLasso estimator and show the
globally linear convergence of the algorithm. We also demonstrate encouraging
empirical performance of the CLasso through numerical studies
Precise Error Analysis of the LASSO under Correlated Designs
In this paper, we consider the problem of recovering a sparse signal from
noisy linear measurements using the so called LASSO formulation. We assume a
correlated Gaussian design matrix with additive Gaussian noise. We precisely
analyze the high dimensional asymptotic performance of the LASSO under
correlated design matrices using the Convex Gaussian Min-max Theorem (CGMT). We
define appropriate performance measures such as the mean-square error (MSE),
probability of support recovery, element error rate (EER) and cosine
similarity. Numerical simulations are presented to validate the derived
theoretical results
Optimum GSSK Transmission in Massive MIMO Systems Using the Box-LASSO Decoder
We propose in this work to employ the Box-LASSO, a variation of the popular
LASSO method, as a low-complexity decoder in a massive multiple-input
multiple-output (MIMO) wireless communication system. The Box-LASSO is mainly
useful for detecting simultaneously structured signals such as signals that are
known to be sparse and bounded. One modulation technique that generates
essentially sparse and bounded constellation points is the so-called
generalized space-shift keying (GSSK) modulation. In this direction, we derive
high dimensional sharp characterizations of various performance measures of the
Box-LASSO such as the mean square error, probability of support recovery, and
the element error rate, under independent and identically distributed (i.i.d.)
Gaussian channels that are not perfectly known. In particular, the analytical
characterizations can be used to demonstrate performance improvements of the
Box-LASSO as compared to the widely used standard LASSO. Then, we can use these
measures to optimally tune the involved hyper-parameters of Box-LASSO such as
the regularization parameter. In addition, we derive optimum power allocation
and training duration schemes in a training-based massive MIMO system. Monte
Carlo simulations are used to validate these premises and to show the sharpness
of the derived analytical results
Fast and General Model Selection using Data Depth and Resampling
We present a technique using data depth functions and resampling to perform
best subset variable selection for a wide range of statistical models. We do
this by assigning a score, called an -value, to a candidate model, and use a
fast bootstrap method to approximate sample versions of these -values. Under
general conditions, -values can separate statistical models that adequately
explain properties of the data from those that do not. This results in a fast
algorithm that fits only a single model and evaluates models, being
the number of predictors under consideration, as opposed to the traditional
requirement of fitting and evaluating models. We illustrate in
simulation experiments that our proposed method typically performs better than
an array of currently used methods for variable selection in linear models and
fixed effect selection in linear mixed models. As a real data application, we
use our procedure to elicit climatic drivers of Indian summer monsoon
precipitation
Bayesian Regression with Undirected Network Predictors with an Application to Brain Connectome Data
This article proposes a Bayesian approach to regression with a continuous
scalar response and an undirected network predictor. Undirected network
predictors are often expressed in terms of symmetric adjacency matrices, with
rows and columns of the matrix representing the nodes, and zero entries
signifying no association between two corresponding nodes. Network predictor
matrices are typically vectorized prior to any analysis, thus failing to
account for the important structural information in the network. This results
in poor inferential and predictive performance in presence of small sample
sizes. We propose a novel class of network shrinkage priors for the coefficient
corresponding to the undirected network predictor. The proposed framework is
devised to detect both nodes and edges in the network predictive of the
response. Our framework is implemented using an efficient Markov Chain Monte
Carlo algorithm. Empirical results in simulation studies illustrate strikingly
superior inferential and predictive gains of the proposed framework in
comparison with the ordinary high dimensional Bayesian shrinkage priors and
penalized optimization schemes. We apply our method to a brain connectome
dataset that contains information on brain networks along with a measure of
creativity for multiple individuals. Here, interest lies in building a
regression model of the creativity measure on the network predictor to identify
important regions and connections in the brain strongly associated with
creativity. To the best of our knowledge, our approach is the first principled
Bayesian method that is able to detect scientifically interpretable regions and
connections in the brain actively impacting the continuous response
(creativity) in the presence of a small sample size
- …