1,800 research outputs found

    Maximin effects in inhomogeneous large-scale data

    Full text link
    Large-scale data are often characterized by some degree of inhomogeneity as data are either recorded in different time regimes or taken from multiple sources. We look at regression models and the effect of randomly changing coefficients, where the change is either smoothly in time or some other dimension or even without any such structure. Fitting varying-coefficient models or mixture models can be appropriate solutions but are computationally very demanding and often return more information than necessary. If we just ask for a model estimator that shows good predictive properties for all regimes of the data, then we are aiming for a simple linear model that is reliable for all possible subsets of the data. We propose the concept of "maximin effects" and a suitable estimator and look at its prediction accuracy from a theoretical point of view in a mixture model with known or unknown group structure. Under certain circumstances the estimator can be computed orders of magnitudes faster than standard penalized regression estimators, making computations on large-scale data feasible. Empirical examples complement the novel methodology and theory.Comment: Published at http://dx.doi.org/10.1214/15-AOS1325 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Statistical paleoclimate reconstructions via Markov random fields

    Full text link
    Understanding centennial scale climate variability requires data sets that are accurate, long, continuous and of broad spatial coverage. Since instrumental measurements are generally only available after 1850, temperature fields must be reconstructed using paleoclimate archives, known as proxies. Various climate field reconstructions (CFR) methods have been proposed to relate past temperature to such proxy networks. In this work, we propose a new CFR method, called GraphEM, based on Gaussian Markov random fields embedded within an EM algorithm. Gaussian Markov random fields provide a natural and flexible framework for modeling high-dimensional spatial fields. At the same time, they provide the parameter reduction necessary for obtaining precise and well-conditioned estimates of the covariance structure, even in the sample-starved setting common in paleoclimate applications. In this paper, we propose and compare the performance of different methods to estimate the graphical structure of climate fields, and demonstrate how the GraphEM algorithm can be used to reconstruct past climate variations. The performance of GraphEM is compared to the widely used CFR method RegEM with regularization via truncated total least squares, using synthetic data. Our results show that GraphEM can yield significant improvements, with uniform gains over space, and far better risk properties. We demonstrate that the spatial structure of temperature fields can be well estimated by graphs where each neighbor is only connected to a few geographically close neighbors, and that the increase in performance is directly related to recovering the underlying sparsity in the covariance of the spatial field. Our work demonstrates how significant improvements can be made in climate reconstruction methods by better modeling the covariance structure of the climate field.Comment: Published at http://dx.doi.org/10.1214/14-AOAS794 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Machine Learning and Materials Informatics: Recent Applications and Prospects

    Full text link
    Propelled partly by the Materials Genome Initiative, and partly by the algorithmic developments and the resounding successes of data-driven efforts in other domains, informatics strategies are beginning to take shape within materials science. These approaches lead to surrogate machine learning models that enable rapid predictions based purely on past data rather than by direct experimentation or by computations/simulations in which fundamental equations are explicitly solved. Data-centric informatics methods are becoming useful to determine material properties that are hard to measure or compute using traditional methods--due to the cost, time or effort involved--but for which reliable data either already exists or can be generated for at least a subset of the critical cases. Predictions are typically interpolative, involving fingerprinting a material numerically first, and then following a mapping (established via a learning algorithm) between the fingerprint and the property of interest. Fingerprints may be of many types and scales, as dictated by the application domain and needs. Predictions may also be extrapolative--extending into new materials spaces--provided prediction uncertainties are properly taken into account. This article attempts to provide an overview of some of the recent successful data-driven "materials informatics" strategies undertaken in the last decade, and identifies some challenges the community is facing and those that should be overcome in the near future

    Precise Performance Analysis of the Box-Elastic Net under Matrix Uncertainties

    Full text link
    In this letter, we consider the problem of recovering an unknown sparse signal from noisy linear measurements, using an enhanced version of the popular Elastic-Net (EN) method. We modify the EN by adding a box-constraint, and we call it the Box-Elastic Net (Box-EN). We assume independent identically distributed (iid) real Gaussian measurement matrix with additive Gaussian noise. In many practical situations, the measurement matrix is not perfectly known, and so we only have a noisy estimate of it. In this work, we precisely characterize the mean squared error and the probability of support recovery of the Box-Elastic Net in the high-dimensional asymptotic regime. Numerical simulations validate the theoretical predictions derived in the paper and also show that the boxed variant outperforms the standard EN.Comment: arXiv admin note: text overlap with arXiv:1808.0430

    Anti-Sampling-Distortion Compressive Wideband Spectrum Sensing for Cognitive Radio

    Full text link
    Too high sampling rate is the bottleneck to wideband spectrum sensing for cognitive radio in mobile communication. Compressed sensing (CS) is introduced to transfer the sampling burden. The standard sparse signal recovery of CS does not consider the distortion in the analogue-to-information converter (AIC). To mitigate performance degeneration casued by the mismatch in least square distortionless constraint which doesn't consider the AIC distortion, we define the sparse signal with the sampling distortion as a bounded additive noise, and An anti-sampling-distortion constraint (ASDC) is deduced. Then we combine the \ell1 norm based sparse constraint with the ASDC to get a novel robust sparse signal recovery operator with sampling distortion. Numerical simulations demonstrate that the proposed method outperforms standard sparse wideband spectrum sensing in accuracy, denoising ability, etc.Comment: 24 pages, 4 figures, 1 table; accepted by International Journal of Mobile Communication

    Statistical inference for high dimensional regression via Constrained Lasso

    Full text link
    In this paper, we propose a new method for estimation and constructing confidence intervals for low-dimensional components in a high-dimensional model. The proposed estimator, called Constrained Lasso (CLasso) estimator, is obtained by simultaneously solving two estimating equations---one imposing a zero-bias constraint for the low-dimensional parameter and the other forming an â„“1\ell_1-penalized procedure for the high-dimensional nuisance parameter. By carefully choosing the zero-bias constraint, the resulting estimator of the low dimensional parameter is shown to admit an asymptotically normal limit attaining the Cram\'{e}r-Rao lower bound in a semiparametric sense. We propose a tuning-free iterative algorithm for implementing the CLasso. We show that when the algorithm is initialized at the Lasso estimator, the de-sparsified estimator proposed in van de Geer et al. [\emph{Ann. Statist.} {\bf 42} (2014) 1166--1202] is asymptotically equivalent to the first iterate of the algorithm. We analyse the asymptotic properties of the CLasso estimator and show the globally linear convergence of the algorithm. We also demonstrate encouraging empirical performance of the CLasso through numerical studies

    Precise Error Analysis of the LASSO under Correlated Designs

    Full text link
    In this paper, we consider the problem of recovering a sparse signal from noisy linear measurements using the so called LASSO formulation. We assume a correlated Gaussian design matrix with additive Gaussian noise. We precisely analyze the high dimensional asymptotic performance of the LASSO under correlated design matrices using the Convex Gaussian Min-max Theorem (CGMT). We define appropriate performance measures such as the mean-square error (MSE), probability of support recovery, element error rate (EER) and cosine similarity. Numerical simulations are presented to validate the derived theoretical results

    Optimum GSSK Transmission in Massive MIMO Systems Using the Box-LASSO Decoder

    Full text link
    We propose in this work to employ the Box-LASSO, a variation of the popular LASSO method, as a low-complexity decoder in a massive multiple-input multiple-output (MIMO) wireless communication system. The Box-LASSO is mainly useful for detecting simultaneously structured signals such as signals that are known to be sparse and bounded. One modulation technique that generates essentially sparse and bounded constellation points is the so-called generalized space-shift keying (GSSK) modulation. In this direction, we derive high dimensional sharp characterizations of various performance measures of the Box-LASSO such as the mean square error, probability of support recovery, and the element error rate, under independent and identically distributed (i.i.d.) Gaussian channels that are not perfectly known. In particular, the analytical characterizations can be used to demonstrate performance improvements of the Box-LASSO as compared to the widely used standard LASSO. Then, we can use these measures to optimally tune the involved hyper-parameters of Box-LASSO such as the regularization parameter. In addition, we derive optimum power allocation and training duration schemes in a training-based massive MIMO system. Monte Carlo simulations are used to validate these premises and to show the sharpness of the derived analytical results

    Fast and General Model Selection using Data Depth and Resampling

    Full text link
    We present a technique using data depth functions and resampling to perform best subset variable selection for a wide range of statistical models. We do this by assigning a score, called an ee-value, to a candidate model, and use a fast bootstrap method to approximate sample versions of these ee-values. Under general conditions, ee-values can separate statistical models that adequately explain properties of the data from those that do not. This results in a fast algorithm that fits only a single model and evaluates p+1p +1 models, pp being the number of predictors under consideration, as opposed to the traditional requirement of fitting and evaluating 2p2^{p} models. We illustrate in simulation experiments that our proposed method typically performs better than an array of currently used methods for variable selection in linear models and fixed effect selection in linear mixed models. As a real data application, we use our procedure to elicit climatic drivers of Indian summer monsoon precipitation

    Bayesian Regression with Undirected Network Predictors with an Application to Brain Connectome Data

    Full text link
    This article proposes a Bayesian approach to regression with a continuous scalar response and an undirected network predictor. Undirected network predictors are often expressed in terms of symmetric adjacency matrices, with rows and columns of the matrix representing the nodes, and zero entries signifying no association between two corresponding nodes. Network predictor matrices are typically vectorized prior to any analysis, thus failing to account for the important structural information in the network. This results in poor inferential and predictive performance in presence of small sample sizes. We propose a novel class of network shrinkage priors for the coefficient corresponding to the undirected network predictor. The proposed framework is devised to detect both nodes and edges in the network predictive of the response. Our framework is implemented using an efficient Markov Chain Monte Carlo algorithm. Empirical results in simulation studies illustrate strikingly superior inferential and predictive gains of the proposed framework in comparison with the ordinary high dimensional Bayesian shrinkage priors and penalized optimization schemes. We apply our method to a brain connectome dataset that contains information on brain networks along with a measure of creativity for multiple individuals. Here, interest lies in building a regression model of the creativity measure on the network predictor to identify important regions and connections in the brain strongly associated with creativity. To the best of our knowledge, our approach is the first principled Bayesian method that is able to detect scientifically interpretable regions and connections in the brain actively impacting the continuous response (creativity) in the presence of a small sample size
    • …
    corecore