18 research outputs found

    A Note on Cross-Validation for Lasso Under Measurement Errors

    No full text
    Variants of the Lasso orℓ1-penalized regression have been proposed to accommodate for presence of measurement errors in the covariates. Theoretical guarantees of these estimates have been established for some oracle values of the regularization parameters which are not known in practice. Data-driven tuning such as cross-validation has not been studied when covariates contain measurement errors. We demonstrate that in the presence of error-in-covariates, even when using a Lasso-variant that adjusts for measurement error, application of naive leave-one-out cross-validation to select the tuning parameter can be problematic. We provide an example where such a practice leads to estimation inconsistency. We also prove that a simple correction to cross-validation procedure restores consistency. We also study the risk consistency of the two cross-validation procedures and offer guideline on the choice of cross-validation based on the measurement error distributions of the training and the prediction data. The theoretical findings are validated using simulated data. Supplementary materials for this article are available online.</p

    Time comparison (in minutes) of PredLMM for varying different knot (sub-sample) sizes with Bolt-REML for Simulation Study (2.2) with 100k individuals.

    No full text
    Time comparison (in minutes) of PredLMM for varying different knot (sub-sample) sizes with Bolt-REML for Simulation Study (2.2) with 100k individuals.</p

    Comparison of PredLMM with GREML (sub) in Simulation Study (1).

    No full text
    Box-plots of the estimates are shown for varying sub-sample sizes (knot-sizes) in four different cases. (TIF)</p

    Time comparison of different methods in seconds for Simulation Study 1 with 5k (8k SNPs) and 8k (13k SNPs) individuals.

    No full text
    Time comparison of different methods in seconds for Simulation Study 1 with 5k (8k SNPs) and 8k (13k SNPs) individuals.</p

    The figure shows the empirical RMSE of different methods from Simulation Study 1.

    No full text
    Each of the four sub-plots corresponds to four different cases. For every case, 100 replications were considered. GREML (sub) had very high RMSE compared to PredLMM and the latter had RMSE close to the full GREML based methods.</p

    Pictorial formulation of .

    No full text
    We look at the full GRM A and its blocks that are used in computing . For sake of simplicity in representation, we assume that first r of the total of N individuals are in the set of knots . (TIF)</p

    Comparison of PredLMM with GREML (sub) in Simulation Study (2.2).

    No full text
    Box-plots of the estimates are shown for varying sub-sample sizes (knot-sizes) in three different cases. (TIF)</p

    The figure shows the time taken by PredLMM with different knot-sizes such as, 2000, 4000, 8000 and 16,000 and by Bolt-REML for a single simulation with 100,000 individuals and 566,000 SNPs (from Simulation Study 2).

    No full text
    The figure shows the time taken by PredLMM with different knot-sizes such as, 2000, 4000, 8000 and 16,000 and by Bolt-REML for a single simulation with 100,000 individuals and 566,000 SNPs (from Simulation Study 2).</p

    The figure shows the bar-plot of the heritability estimates by PredLMM and GREML (sub) with two sub-sample (knot) sizes and by Bolt-REML (pseudo) for seven different real traits.

    No full text
    The figure shows the bar-plot of the heritability estimates by PredLMM and GREML (sub) with two sub-sample (knot) sizes and by Bolt-REML (pseudo) for seven different real traits.</p

    Random Forests for Spatially Dependent Data

    No full text
    Spatial linear mixed-models, consisting of a linear covariate effect and a Gaussian process (GP) distributed spatial random effect, are widely used for analyses of geospatial data. We consider the setting where the covariate effect is nonlinear. Random forests (RF) are popular for estimating nonlinear functions but applications of RF for spatial data have often ignored the spatial correlation. We show that this impacts the performance of RF adversely. We propose RF-GLS, a novel and well-principled extension of RF, for estimating nonlinear covariate effects in spatial mixed models where the spatial correlation is modeled using GP. RF-GLS extends RF in the same way generalized least squares (GLS) fundamentally extends ordinary least squares (OLS) to accommodate for dependence in linear models. RF becomes a special case of RF-GLS, and is substantially outperformed by RF-GLS for both estimation and prediction across extensive numerical experiments with spatially correlated data. RF-GLS can be used for functional estimation in other types of dependent data like time series. We prove consistency of RF-GLS for β-mixing dependent error processes that include the popular spatial Matérn GP. As a byproduct, we also establish, to our knowledge, the first consistency result for RF under dependence. We establish results of independent importance, including a general consistency result of GLS optimizers of data-driven function classes, and a uniform law of large number under β-mixing dependence with weaker assumptions. These new tools can be potentially useful for asymptotic analysis of other GLS-style estimators in nonparametric regression with dependent data.</p
    corecore