301,360 research outputs found

    A two step model for linear prediction, with connections to PLS

    Get PDF
    In the thesis, we consider prediction of a univariate response variable, especially when the explanatory variables are almost collinear. A two step approach has been proposed. The first step is to summarize the information in the explanatory variables via a bilinear model with a Krylov structured design matrix. The second step is the prediction step where a conditional predictor is applied. The two step approach gives us a new insight in partial least squares regression (PLS). Explicit maximum likelihood estimators of the variances and mean for the explanatory variables are derived. It is shown that the mean square error of the predictor in the two step model is always smaller than the one in PLS. Moreover, the two step model has been extended to handle grouped data. A real data set is analyzed to illustrate the performance of the two step approach and to compare it with other regularized methods

    Evaluation of direct and indirect methods for modelling the joint distribution of tree diameter and height data with the bivariate Johnson’s SBB function to forest stands

    Get PDF
    Aim of study: In this study, both the direct and indirect methods by conditional maximum likelihood (CML) and moments for fitting Johnson’s SBB were evaluated. To date, Johnson’s SBB has been fitted by either indirect (two-stage) method using well-known procedures for the marginal diameter and heights, or direct methods, where all parameters are estimated at once. Application of bivariate Johnson’s SBB for predicting height and improving volume estimation requires a suitable fitting method.Area of study: E. globulus, P. pinaster and P. radiata stands in northwest Spain.Material and methods: The data set comprised of 308, 184 and 96 permanent sample plots (PSPs) from the aforementioned species. The suitability of the method was evaluated based on height and volume prediction. Indices including coefficient of determination (R2), root mean square Error (RMSE), model efficiency (MEF), Bayesian Information Criterion (BIC) and Hannan-Quinn Criterion (HQC) were used to assess the model predictions. Significant difference between observed and predicted tree height and volumes were tested using paired sample t-test at 5% level for each plot by species.Main results: The indirect method by CML was the most suitable method for height and volume prediction in the three species. The R2 and RMSE for height prediction ranged from 0.994 – 0.820 and 1.454 – 1.676, respectively. The percentage of plot in which the observed and predicted heights were significant was 0.32%. The direct method was the least performed method especially for height prediction in E. globulus.Research highlights: The indirect (two-stage) method, especially by conditional maximum likelihood, was the most suitable method for the bivariate Johnson’s SBB distribution.Keywords: conditional maximum likelihood; moments; two-stage method; direct method; tree volume

    Similarity-Based Models of Word Cooccurrence Probabilities

    Full text link
    In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations ``eat a peach'' and ``eat a beach'' is more likely. Statistical NLP methods determine the likelihood of a word combination from its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in any given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on ``most similar'' words. We describe probabilistic word association models based on distributional word similarity, and apply them to two tasks, language modeling and pseudo-word disambiguation. In the language modeling task, a similarity-based model is used to improve probability estimates for unseen bigrams in a back-off language model. The similarity-based method yields a 20% perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speech-recognition error. We also compare four similarity-based estimation methods against back-off and maximum-likelihood estimation methods on a pseudo-word sense disambiguation task in which we controlled for both unigram and bigram frequency to avoid giving too much weight to easy-to-disambiguate high-frequency configurations. The similarity-based methods perform up to 40% better on this particular task.Comment: 26 pages, 5 figure

    Analysis of factorial experiments using mixed-effects models: options for estimation, prediction and inference

    Get PDF
    In linear mixed-effects modelling of experiments, estimation of variance components, prediction of random effects, and computation of denominator degrees of freedom associated with inference on fixed effects, are important elements of the analysis. This thesis investigates alternatives to the likelihoodbased procedures for analysis of factorial experiments with normally distributed observations. Consistent methods, such as the maximum likelihood method, can be disadvantageous in cases where only small samples are available. Moreover, the algorithms used in linear mixed-effects models can be computationally demanding in large datasets. In this thesis, Henderson’s method 3, a non-iterative variance component estimation method, was considered for estimation of the variance components in a two-way mixed linear model with three variance components. The variance component estimator corresponding to one of the random effects was improved by perturbing the standard unbiased estimator. The improved variance component estimator performed better in terms of mean square error. In an application on a quantitative trait loci (QTL) study, the modified estimator was compared to the restricted maximum likelihood estimator on data from European wild boar × domestic pig intercross. The modified estimator was shown to approximate the results obtained from the restricted maximum likelihood (REML) method very closely. For balanced and unbalanced data in two-way with and without interaction models, the generalized prediction intervals for the random effects were derived. The coverage probabilities of the proposed intervals were compared with those based on the REML method and the approximate methods of Satterthwaite (1946) and Kenward and Roger (1997). The coverage of the proposed intervals was closer to the chosen nominal level than coverage of prediction intervals based on the REML method. With focus on Type I error, the implications of the available options in the mixed procedure of SAS and the lmer function of R for the inference on the fixed effects were examined. With the default setting of SAS, the frequency of Type I error was higher than with R. The Type I error rate in SAS was close to the nominal value when negative estimates of the variance components were allowed. Both software packages occasionally produced inaccurate results

    Prediction Errors for Penalized Regressions based on Generalized Approximate Message Passing

    Full text link
    We discuss the prediction accuracy of assumed statistical models in terms of prediction errors for the generalized linear model and penalized maximum likelihood methods. We derive the forms of estimators for the prediction errors, such as CpC_p criterion, information criteria, and leave-one-out cross validation (LOOCV) error, using the generalized approximate message passing (GAMP) algorithm and replica method. These estimators coincide with each other when the number of model parameters is sufficiently small; however, there is a discrepancy between them in particular in the parameter region where the number of model parameters is larger than the data dimension. In this paper, we review the prediction errors and corresponding estimators, and discuss their differences. In the framework of GAMP, we show that the information criteria can be expressed by using the variance of the estimates. Further, we demonstrate how to approach LOOCV error from the information criteria by utilizing the expression provided by GAMP.Comment: 73 pages, 13 figures, accepted Journal of Physics

    Refined instrumental variable estimation: maximum likelihood optimization of a unified Box–Jenkins model

    No full text
    For many years, various methods for the identification and estimation of parameters in linear, discretetime transfer functions have been available and implemented in widely available Toolboxes for MatlabTM. This paper considers a unified Refined Instrumental Variable (RIV) approach to the estimation of discrete and continuous-time transfer functions characterized by a unified operator that can be interpreted in terms of backward shift, derivative or delta operators. The estimation is based on the formulation of a pseudo-linear regression relationship involving optimal prefilters that is derived from an appropriately unified Box–Jenkins transfer function model. The paper shows that, contrary to apparently widely held beliefs, the iterative RIV algorithm provides a reliable solution to the maximum likelihood optimization equations for this class of Box–Jenkins transfer function models and so its en bloc or recursive parameter estimates are optimal in maximum likelihood, prediction error minimization and instrumental variable terms
    • …
    corecore