80,285 research outputs found
High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity
Although the standard formulations of prediction problems involve
fully-observed and noiseless data drawn in an i.i.d. manner, many applications
involve noisy and/or missing data, possibly involving dependence, as well. We
study these issues in the context of high-dimensional sparse linear regression,
and propose novel estimators for the cases of noisy, missing and/or dependent
data. Many standard approaches to noisy or missing data, such as those using
the EM algorithm, lead to optimization problems that are inherently nonconvex,
and it is difficult to establish theoretical guarantees on practical
algorithms. While our approach also involves optimizing nonconvex programs, we
are able to both analyze the statistical error associated with any global
optimum, and more surprisingly, to prove that a simple algorithm based on
projected gradient descent will converge in polynomial time to a small
neighborhood of the set of all global minimizers. On the statistical side, we
provide nonasymptotic bounds that hold with high probability for the cases of
noisy, missing and/or dependent data. On the computational side, we prove that
under the same types of conditions required for statistical consistency, the
projected gradient descent algorithm is guaranteed to converge at a geometric
rate to a near-global minimizer. We illustrate these theoretical predictions
with simulations, showing close agreement with the predicted scalings.Comment: Published in at http://dx.doi.org/10.1214/12-AOS1018 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
MCMC Methods for Multi-Response Generalized Linear Mixed Models: The MCMCglmm R Package
Generalized linear mixed models provide a flexible framework for modeling a range of data, although with non-Gaussian response variables the likelihood cannot be obtained in closed form. Markov chain Monte Carlo methods solve this problem by sampling from a series of simpler conditional distributions that can be evaluated. The R package MCMCglmm implements such an algorithm for a range of model fitting problems. More than one response variable can be analyzed simultaneously, and these variables are allowed to follow Gaussian, Poisson, multi(bi)nominal, exponential, zero-inflated and censored distributions. A range of variance structures are permitted for the random effects, including interactions with categorical or continuous variables (i.e., random regression), and more complicated variance structures that arise through shared ancestry, either through a pedigree or through a phylogeny. Missing values are permitted in the response variable(s) and data can be known up to some level of measurement error as in meta-analysis. All simu- lation is done in C/ C++ using the CSparse library for sparse linear systems.
Efficient hybrid modeling and sorption model discovery for non-linear advection-diffusion-sorption systems: A systematic scientific machine learning approach
This study presents a systematic machine learning approach for creating
efficient hybrid models and discovering sorption uptake models in non-linear
advection-diffusion-sorption systems. It demonstrates an effective method to
train these complex systems using gradientbased optimizers, adjoint sensitivity
analysis, and JIT-compiled vector Jacobian products, combined with spatial
discretization and adaptive integrators. Sparse and symbolic regression were
employed to identify missing functions in the artificial neural network. The
robustness of the proposed method was tested on an in-silico data set of noisy
breakthrough curve observations of fixed-bed adsorption, resulting in a
well-fitted hybrid model. The study successfully reconstructed sorption uptake
kinetics using sparse and symbolic regression, and accurately predicted
breakthrough curves using identified polynomials, highlighting the potential of
the proposed framework for discovering sorption kinetic law structures.Comment: Preprint paper to be submitted soon in Elsevier Journa
A Max-Product EM Algorithm for Reconstructing Markov-tree Sparse Signals from Compressive Samples
We propose a Bayesian expectation-maximization (EM) algorithm for
reconstructing Markov-tree sparse signals via belief propagation. The
measurements follow an underdetermined linear model where the
regression-coefficient vector is the sum of an unknown approximately sparse
signal and a zero-mean white Gaussian noise with an unknown variance. The
signal is composed of large- and small-magnitude components identified by
binary state variables whose probabilistic dependence structure is described by
a Markov tree. Gaussian priors are assigned to the signal coefficients given
their state variables and the Jeffreys' noninformative prior is assigned to the
noise variance. Our signal reconstruction scheme is based on an EM iteration
that aims at maximizing the posterior distribution of the signal and its state
variables given the noise variance. We construct the missing data for the EM
iteration so that the complete-data posterior distribution corresponds to a
hidden Markov tree (HMT) probabilistic graphical model that contains no loops
and implement its maximization (M) step via a max-product algorithm. This EM
algorithm estimates the vector of state variables as well as solves iteratively
a linear system of equations to obtain the corresponding signal estimate. We
select the noise variance so that the corresponding estimated signal and state
variables obtained upon convergence of the EM iteration have the largest
marginal posterior distribution. We compare the proposed and existing
state-of-the-art reconstruction methods via signal and image reconstruction
experiments.Comment: To appear in IEEE Transactions on Signal Processin
Influence of rainfall observation network on model calibration and application
International audienceThe objective in this study is to investigate the influence of the spatial resolution of the rainfall input on the model calibration and application. The analysis is carried out by varying the distribution of the raingauge network. The semi-distributed HBV model is calibrated with the precipitation interpolated from the available observed rainfall of the different raingauge networks. An automatic calibration method based on the combinatorial optimization algorithm simulated annealing is applied. Aggregated Nash-Sutcliffe coefficients at different temporal scales are adopted as objective function to estimate the model parameters. The performance of the hydrological model is analyzed as a function of the raingauge density. The calibrated model is validated using the same precipitation used for the calibration as well as interpolated precipitation based on networks of reduced and increased raingauge density. The effect of missing rainfall data is investigated by using a multiple linear regression approach for filling the missing values. The model, calibrated with the complete set of observed data, is then run in the validation period using the above described precipitation field. The simulated hydrographs obtained in the three sets of experiments are analyzed through the comparisons of the computed Nash-Sutcliffe coefficient and several goodness-of-fit indexes. The results show that the model using different raingauge networks might need recalibration of the model parameters: model calibrated on sparse information might perform well on dense information while model calibrated on dense information fails on sparse information. Also, the model calibrated with complete set of observed precipitation and run with incomplete observed data associated with the data estimated using multiple linear regressions, at the locations treated as missing measurements, performs well. A meso-scale catchment located in the south-west of Germany has been selected for this study
Influence of rainfall observation network on model calibration and application
The objective in this study is to investigate the influence of the spatial resolution of the rainfall input on the model calibration and application. The analysis is carried out by varying the distribution of the raingauge network. A meso-scale catchment located in southwest Germany has been selected for this study. First, the semi-distributed HBV model is calibrated with the precipitation interpolated from the available observed rainfall of the different raingauge networks. An automatic calibration method based on the combinatorial optimization algorithm simulated annealing is applied. The performance of the hydrological model is analyzed as a function of the raingauge density. Secondly, the calibrated model is validated using interpolated precipitation from the same raingauge density used for the calibration as well as interpolated precipitation based on networks of reduced and increased raingauge density. Lastly, the effect of missing rainfall data is investigated by using a multiple linear regression approach for filling in the missing measurements. The model, calibrated with the complete set of observed data, is then run in the validation period using the above described precipitation field. The simulated hydrographs obtained in the above described three sets of experiments are analyzed through the comparisons of the computed Nash-Sutcliffe coefficient and several goodness-of-fit indexes. The results show that the model using different raingauge networks might need re-calibration of the model parameters, specifically model calibrated on relatively sparse precipitation information might perform well on dense precipitation information while model calibrated on dense precipitation information fails on sparse precipitation information. Also, the model calibrated with the complete set of observed precipitation and run with incomplete observed data associated with the data estimated using multiple linear regressions, at the locations treated as missing measurements, performs well
Node harvest
When choosing a suitable technique for regression and classification with
multivariate predictor variables, one is often faced with a tradeoff between
interpretability and high predictive accuracy. To give a classical example,
classification and regression trees are easy to understand and interpret. Tree
ensembles like Random Forests provide usually more accurate predictions. Yet
tree ensembles are also more difficult to analyze than single trees and are
often criticized, perhaps unfairly, as `black box' predictors. Node harvest is
trying to reconcile the two aims of interpretability and predictive accuracy by
combining positive aspects of trees and tree ensembles. Results are very sparse
and interpretable and predictive accuracy is extremely competitive, especially
for low signal-to-noise data. The procedure is simple: an initial set of a few
thousand nodes is generated randomly. If a new observation falls into just a
single node, its prediction is the mean response of all training observation
within this node, identical to a tree-like prediction. A new observation falls
typically into several nodes and its prediction is then the weighted average of
the mean responses across all these nodes. The only role of node harvest is to
`pick' the right nodes from the initial large ensemble of nodes by choosing
node weights, which amounts in the proposed algorithm to a quadratic
programming problem with linear inequality constraints. The solution is sparse
in the sense that only very few nodes are selected with a nonzero weight. This
sparsity is not explicitly enforced. Maybe surprisingly, it is not necessary to
select a tuning parameter for optimal predictive accuracy. Node harvest can
handle mixed data and missing values and is shown to be simple to interpret and
competitive in predictive accuracy on a variety of data sets.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS367 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …