80,285 research outputs found

    High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity

    Get PDF
    Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependence, as well. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently nonconvex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing nonconvex programs, we are able to both analyze the statistical error associated with any global optimum, and more surprisingly, to prove that a simple algorithm based on projected gradient descent will converge in polynomial time to a small neighborhood of the set of all global minimizers. On the statistical side, we provide nonasymptotic bounds that hold with high probability for the cases of noisy, missing and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm is guaranteed to converge at a geometric rate to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing close agreement with the predicted scalings.Comment: Published in at http://dx.doi.org/10.1214/12-AOS1018 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    MCMC Methods for Multi-Response Generalized Linear Mixed Models: The MCMCglmm R Package

    Get PDF
    Generalized linear mixed models provide a flexible framework for modeling a range of data, although with non-Gaussian response variables the likelihood cannot be obtained in closed form. Markov chain Monte Carlo methods solve this problem by sampling from a series of simpler conditional distributions that can be evaluated. The R package MCMCglmm implements such an algorithm for a range of model fitting problems. More than one response variable can be analyzed simultaneously, and these variables are allowed to follow Gaussian, Poisson, multi(bi)nominal, exponential, zero-inflated and censored distributions. A range of variance structures are permitted for the random effects, including interactions with categorical or continuous variables (i.e., random regression), and more complicated variance structures that arise through shared ancestry, either through a pedigree or through a phylogeny. Missing values are permitted in the response variable(s) and data can be known up to some level of measurement error as in meta-analysis. All simu- lation is done in C/ C++ using the CSparse library for sparse linear systems.

    Efficient hybrid modeling and sorption model discovery for non-linear advection-diffusion-sorption systems: A systematic scientific machine learning approach

    Full text link
    This study presents a systematic machine learning approach for creating efficient hybrid models and discovering sorption uptake models in non-linear advection-diffusion-sorption systems. It demonstrates an effective method to train these complex systems using gradientbased optimizers, adjoint sensitivity analysis, and JIT-compiled vector Jacobian products, combined with spatial discretization and adaptive integrators. Sparse and symbolic regression were employed to identify missing functions in the artificial neural network. The robustness of the proposed method was tested on an in-silico data set of noisy breakthrough curve observations of fixed-bed adsorption, resulting in a well-fitted hybrid model. The study successfully reconstructed sorption uptake kinetics using sparse and symbolic regression, and accurately predicted breakthrough curves using identified polynomials, highlighting the potential of the proposed framework for discovering sorption kinetic law structures.Comment: Preprint paper to be submitted soon in Elsevier Journa

    A Max-Product EM Algorithm for Reconstructing Markov-tree Sparse Signals from Compressive Samples

    Full text link
    We propose a Bayesian expectation-maximization (EM) algorithm for reconstructing Markov-tree sparse signals via belief propagation. The measurements follow an underdetermined linear model where the regression-coefficient vector is the sum of an unknown approximately sparse signal and a zero-mean white Gaussian noise with an unknown variance. The signal is composed of large- and small-magnitude components identified by binary state variables whose probabilistic dependence structure is described by a Markov tree. Gaussian priors are assigned to the signal coefficients given their state variables and the Jeffreys' noninformative prior is assigned to the noise variance. Our signal reconstruction scheme is based on an EM iteration that aims at maximizing the posterior distribution of the signal and its state variables given the noise variance. We construct the missing data for the EM iteration so that the complete-data posterior distribution corresponds to a hidden Markov tree (HMT) probabilistic graphical model that contains no loops and implement its maximization (M) step via a max-product algorithm. This EM algorithm estimates the vector of state variables as well as solves iteratively a linear system of equations to obtain the corresponding signal estimate. We select the noise variance so that the corresponding estimated signal and state variables obtained upon convergence of the EM iteration have the largest marginal posterior distribution. We compare the proposed and existing state-of-the-art reconstruction methods via signal and image reconstruction experiments.Comment: To appear in IEEE Transactions on Signal Processin

    Influence of rainfall observation network on model calibration and application

    No full text
    International audienceThe objective in this study is to investigate the influence of the spatial resolution of the rainfall input on the model calibration and application. The analysis is carried out by varying the distribution of the raingauge network. The semi-distributed HBV model is calibrated with the precipitation interpolated from the available observed rainfall of the different raingauge networks. An automatic calibration method based on the combinatorial optimization algorithm simulated annealing is applied. Aggregated Nash-Sutcliffe coefficients at different temporal scales are adopted as objective function to estimate the model parameters. The performance of the hydrological model is analyzed as a function of the raingauge density. The calibrated model is validated using the same precipitation used for the calibration as well as interpolated precipitation based on networks of reduced and increased raingauge density. The effect of missing rainfall data is investigated by using a multiple linear regression approach for filling the missing values. The model, calibrated with the complete set of observed data, is then run in the validation period using the above described precipitation field. The simulated hydrographs obtained in the three sets of experiments are analyzed through the comparisons of the computed Nash-Sutcliffe coefficient and several goodness-of-fit indexes. The results show that the model using different raingauge networks might need recalibration of the model parameters: model calibrated on sparse information might perform well on dense information while model calibrated on dense information fails on sparse information. Also, the model calibrated with complete set of observed precipitation and run with incomplete observed data associated with the data estimated using multiple linear regressions, at the locations treated as missing measurements, performs well. A meso-scale catchment located in the south-west of Germany has been selected for this study

    Influence of rainfall observation network on model calibration and application

    Get PDF
    The objective in this study is to investigate the influence of the spatial resolution of the rainfall input on the model calibration and application. The analysis is carried out by varying the distribution of the raingauge network. A meso-scale catchment located in southwest Germany has been selected for this study. First, the semi-distributed HBV model is calibrated with the precipitation interpolated from the available observed rainfall of the different raingauge networks. An automatic calibration method based on the combinatorial optimization algorithm simulated annealing is applied. The performance of the hydrological model is analyzed as a function of the raingauge density. Secondly, the calibrated model is validated using interpolated precipitation from the same raingauge density used for the calibration as well as interpolated precipitation based on networks of reduced and increased raingauge density. Lastly, the effect of missing rainfall data is investigated by using a multiple linear regression approach for filling in the missing measurements. The model, calibrated with the complete set of observed data, is then run in the validation period using the above described precipitation field. The simulated hydrographs obtained in the above described three sets of experiments are analyzed through the comparisons of the computed Nash-Sutcliffe coefficient and several goodness-of-fit indexes. The results show that the model using different raingauge networks might need re-calibration of the model parameters, specifically model calibrated on relatively sparse precipitation information might perform well on dense precipitation information while model calibrated on dense precipitation information fails on sparse precipitation information. Also, the model calibrated with the complete set of observed precipitation and run with incomplete observed data associated with the data estimated using multiple linear regressions, at the locations treated as missing measurements, performs well

    Node harvest

    Full text link
    When choosing a suitable technique for regression and classification with multivariate predictor variables, one is often faced with a tradeoff between interpretability and high predictive accuracy. To give a classical example, classification and regression trees are easy to understand and interpret. Tree ensembles like Random Forests provide usually more accurate predictions. Yet tree ensembles are also more difficult to analyze than single trees and are often criticized, perhaps unfairly, as `black box' predictors. Node harvest is trying to reconcile the two aims of interpretability and predictive accuracy by combining positive aspects of trees and tree ensembles. Results are very sparse and interpretable and predictive accuracy is extremely competitive, especially for low signal-to-noise data. The procedure is simple: an initial set of a few thousand nodes is generated randomly. If a new observation falls into just a single node, its prediction is the mean response of all training observation within this node, identical to a tree-like prediction. A new observation falls typically into several nodes and its prediction is then the weighted average of the mean responses across all these nodes. The only role of node harvest is to `pick' the right nodes from the initial large ensemble of nodes by choosing node weights, which amounts in the proposed algorithm to a quadratic programming problem with linear inequality constraints. The solution is sparse in the sense that only very few nodes are selected with a nonzero weight. This sparsity is not explicitly enforced. Maybe surprisingly, it is not necessary to select a tuning parameter for optimal predictive accuracy. Node harvest can handle mixed data and missing values and is shown to be simple to interpret and competitive in predictive accuracy on a variety of data sets.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS367 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org
    corecore