2,620 research outputs found
Dominance and GĂE interaction effects improvegenomic prediction and genetic gain inintermediate wheatgrass (Thinopyrumintermedium)
Genomic selection (GS) based recurrent selection methods were developed to accelerate the domestication of intermediate wheatgrass [IWG, Thinopyrum intermedium (Host) Barkworth & D.R. Dewey]. A subset of the breeding population phenotyped at multiple environments is used to train GS models and then predict trait values of the breeding population. In this study, we implemented several GS models that investigated the use of additive and dominance effects and GĂE interaction effects to understand how they affected trait predictions in intermediate wheatgrass. We evaluated 451 genotypes from the University of Minnesota IWG breeding program for nine agronomic and domestication traits at two Minnesota locations during 2017â2018. Genet-mean based heritabilities for these traits ranged from 0.34 to 0.77. Using fourfold cross validation, we observed the highest predictive abilities (correlation of 0.67) in models that considered GĂE effects. When GĂE effects were fitted in GS models, trait predictions improved by 18%, 15%, 20%, and 23% for yield, spike weight, spike length, and free threshing, respectively. Genomic selection models with dominance effects showed only modest increases of up to 3% and were trait-dependent. Crossenvironment predictions were better for high heritability traits such as spike length, shatter resistance, free threshing, grain weight, and seed length than traits with low heritability and large environmental variance such as spike weight, grain yield, and seed width. Our results confirm that GS can accelerate IWG domestication by increasing genetic gain per breeding cycle and assist in selection of genotypes with promise of better performance in diverse environments
From Fixed-X to Random-X Regression: Bias-Variance Decompositions, Covariance Penalties, and Prediction Error Estimation
In statistical prediction, classical approaches for model selection and model
evaluation based on covariance penalties are still widely used. Most of the
literature on this topic is based on what we call the "Fixed-X" assumption,
where covariate values are assumed to be nonrandom. By contrast, it is often
more reasonable to take a "Random-X" view, where the covariate values are
independently drawn for both training and prediction. To study the
applicability of covariance penalties in this setting, we propose a
decomposition of Random-X prediction error in which the randomness in the
covariates contributes to both the bias and variance components. This
decomposition is general, but we concentrate on the fundamental case of least
squares regression. We prove that in this setting the move from Fixed-X to
Random-X prediction results in an increase in both bias and variance. When the
covariates are normally distributed and the linear model is unbiased, all terms
in this decomposition are explicitly computable, which yields an extension of
Mallows' Cp that we call . also holds asymptotically for certain
classes of nonnormal covariates. When the noise variance is unknown, plugging
in the usual unbiased estimate leads to an approach that we call ,
which is closely related to Sp (Tukey 1967), and GCV (Craven and Wahba 1978).
For excess bias, we propose an estimate based on the "shortcut-formula" for
ordinary cross-validation (OCV), resulting in an approach we call .
Theoretical arguments and numerical simulations suggest that is
typically superior to OCV, though the difference is small. We further examine
the Random-X error of other popular estimators. The surprising result we get
for ridge regression is that, in the heavily-regularized regime, Random-X
variance is smaller than Fixed-X variance, which can lead to smaller overall
Random-X error
A Comparative Review of Dimension Reduction Methods in Approximate Bayesian Computation
Approximate Bayesian computation (ABC) methods make use of comparisons
between simulated and observed summary statistics to overcome the problem of
computationally intractable likelihood functions. As the practical
implementation of ABC requires computations based on vectors of summary
statistics, rather than full data sets, a central question is how to derive
low-dimensional summary statistics from the observed data with minimal loss of
information. In this article we provide a comprehensive review and comparison
of the performance of the principal methods of dimension reduction proposed in
the ABC literature. The methods are split into three nonmutually exclusive
classes consisting of best subset selection methods, projection techniques and
regularization. In addition, we introduce two new methods of dimension
reduction. The first is a best subset selection method based on Akaike and
Bayesian information criteria, and the second uses ridge regression as a
regularization procedure. We illustrate the performance of these dimension
reduction techniques through the analysis of three challenging models and data
sets.Comment: Published in at http://dx.doi.org/10.1214/12-STS406 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Model Selection Techniques for Kernel-Based Regression Analysis Using Information Complexity Measure and Genetic Algorithms
In statistical modeling, an overparameterized model leads to poor generalization on unseen data points. This issue requires a model selection technique that appropriately chooses the form, the parameters of the proposed model and the independent variables retained for the modeling. Model selection is particularly important for linear and nonlinear statistical models, which can be easily overfitted.
Recently, support vector machines (SVMs), also known as kernel-based methods, have drawn much attention as the next generation of nonlinear modeling techniques. The model selection issues for SVMs include the selection of the kernel, the corresponding parameters and the optimal subset of independent variables. In the current literature, k-fold cross-validation is the widely utilized model selection method for SVMs by the machine learning researchers. However, cross-validation is computationally intensive since one has to fit the model k times.
This dissertation introduces the use of a model selection criterion based on information complexity (ICOMP) measure for kernel-based regression analysis and its applications. ICOMP penalizes both the lack-of-fit and the complexity of the model to choose the optimal model with good generalization properties. ICOMP provides a simple index for each model and does not require any validation data. It is computationally efficient and it has been successfully applied to various linear model selection problems. In this dissertation, we introduce ICOMP to the nonlinear kernel-based modeling areas. Specifically, this dissertation proposes ICOMP and its various forms in the area of kernel ridge regression; kernel partial least squares regression; kernel principal component analysis; kernel principal component regression; relevance vector regression; relevance vector logistic regression and classification problems. The model selection tasks achieved by our proposed criterion include choosing the form of the kernel function, the parameters of the kernel function, the ridge parameter, the number of latent variables, the number of principal components and the optimal subset of input variables in a simultaneous fashion for intelligent data mining.
The performance of the proposed model selection method is tested on simulation bench- mark data sets as well as real data sets. The predictive performance of the proposed model selection criteria are comparable to and even better than cross-validation, which is too costly to compute and not efficient.
This dissertation combines the Genetic Algorithm with ICOMP in variable subsetting, which significantly decreases the computational time as compared to the exhaustive search of all possible subsets. GA procedure is shown to be robust and performs well in our repeated simulation examples.
Therefore, this dissertation provides researchers an alternative computationally efficient model selection approach for data analysis using kernel methods
Quantifying Epistemic Uncertainty in Deep Learning
Uncertainty quantification is at the core of the reliability and robustness
of machine learning. In this paper, we provide a theoretical framework to
dissect the uncertainty, especially the epistemic component, in deep learning
into procedural variability (from the training procedure) and data variability
(from the training data), which is the first such attempt in the literature to
our best knowledge. We then propose two approaches to estimate these
uncertainties, one based on influence function and one on batching. We
demonstrate how our approaches overcome the computational difficulties in
applying classical statistical methods. Experimental evaluations on multiple
problem settings corroborate our theory and illustrate how our framework and
estimation can provide direct guidance on modeling and data collection effort
to improve deep learning performance
- âŠ