391 research outputs found

    Estimation and Regularization Techniques for Regression Models with Multidimensional Prediction Functions

    Get PDF
    Boosting is one of the most important methods for fitting regression models and building prediction rules from high-dimensional data. A notable feature of boosting is that the technique has a built-in mechanism for shrinking coefficient estimates and variable selection. This regularization mechanism makes boosting a suitable method for analyzing data characterized by small sample sizes and large numbers of predictors. We extend the existing methodology by developing a boosting method for prediction functions with multiple components. Such multidimensional functions occur in many types of statistical models, for example in count data models and in models involving outcome variables with a mixture distribution. As will be demonstrated, the new algorithm is suitable for both the estimation of the prediction function and regularization of the estimates. In addition, nuisance parameters can be estimated simultaneously with the prediction function

    A Simulation Study to Evaluate Bayesian LASSOā€™s Performance in Zero-Inflated Poisson (ZIP) Models

    Get PDF
    When modelling count data, it is possible to have excessive zeros in the data in many applications. My thesis concentrates on the variable selection in zero-inflated Poisson (ZIP) models. This thesis work is motivated by Brown et al. (2015), who considered the excessive amount of zero in their data structure and the site-specific random effects, and used Bayesian LASSO method for variable selection in their post-fire tree recruitment study in interior Alaska, USA and north Yukon, Canada. However, the above study has not carried out systematic simulation studies to evaluate Bayesian LASSOā€™s performance under different scenarios. Therefore, my thesis conducts a series of simulation studies to evaluate Bayesian LASSOā€™s performance with respect to different setting of some simulation factors. My thesis considers three simulation factors: the number of subjects (N), the number of repeated measurements (R) and the true values of regression coefficients in the ZIP models. With different settings of the three factors, the proposed Bayesian LASSOā€™s performance would be evaluated using three indicators: the sensitivity, the specificity and the exact fit rate. For applied practitioners, my thesis would be a useful example demonstrating under what circumstances one can expect Bayesian LASSO to have good performance in ZIP models. After sorting out the simulation results, we can find that Bayesian LASSOā€™s performance is jointly affected by all the three simulation factors, while this method of variable selection is more reliable when the true coefficients are not close to zero. My thesis also has some limitations. Primarily, with the time limitation of my thesis, it is impossible to consider all the factors that can potentially affect the simulation results, and using other penalty forms other than L1 penalty is also left for future researchers to work on. Moreover, the current variable selection method is only for fixed effects selection while the variable selection for the mixed effect selection in ZIP models can be a direction for future work

    Estimation and Feature Selection in Mixtures of Generalized Linear Experts Models

    Full text link
    Mixtures-of-Experts (MoE) are conditional mixture models that have shown their performance in modeling heterogeneity in data in many statistical learning approaches for prediction, including regression and classification, as well as for clustering. Their estimation in high-dimensional problems is still however challenging. We consider the problem of parameter estimation and feature selection in MoE models with different generalized linear experts models, and propose a regularized maximum likelihood estimation that efficiently encourages sparse solutions for heterogeneous data with high-dimensional predictors. The developed proximal-Newton EM algorithm includes proximal Newton-type procedures to update the model parameter by monotonically maximizing the objective function and allows to perform efficient estimation and feature selection. An experimental study shows the good performance of the algorithms in terms of recovering the actual sparse solutions, parameter estimation, and clustering of heterogeneous regression data, compared to the main state-of-the art competitors.Comment: arXiv admin note: text overlap with arXiv:1810.1216

    Bayesian Conditional Auto-Regressive LASSO Models to Learn Sparse Biological Networks with Predictors

    Full text link
    Microbiome data analyses require statistical models that can simultaneously decode microbes' reaction to the environment and interactions among microbes. While a multiresponse linear regression model seems like a straight-forward solution, we argue that treating it as a graphical model is flawed given that the regression coefficient matrix does not encode the conditional dependence structure between response and predictor nodes because it does not represent the adjacency matrix. This observation is especially important in biological settings when we have prior knowledge on the edges from specific experimental interventions that can only be properly encoded under a conditional dependence model. Here, we propose a chain graph model with two sets of nodes (predictors and responses) whose solution yields a graph with edges that indeed represent conditional dependence and thus, agrees with the experimenter's intuition on average behavior of nodes under treatment. The solution to our model is sparse via Bayesian LASSO and is also guaranteed to be the sparse solution to a Conditional Auto-Regressive (CAR) model. In addition, we propose an adaptive extension so that different shrinkage can be applied to different edges to incorporate edge-specific prior knowledge. Our model is computationally inexpensive through an efficient Gibbs sampling algorithm and can account for binary, counting and compositional responses via appropriate hierarchical structure. Finally, we apply our model to a human gut and a soil microbial composition datasets

    Inference for feature selection using the Lasso with high-dimensional data

    Full text link
    Penalized regression models such as the Lasso have proved useful for variable selection in many fields - especially for situations with high-dimensional data where the numbers of predictors far exceeds the number of observations. These methods identify and rank variables of importance but do not generally provide any inference of the selected variables. Thus, the variables selected might be the "most important" but need not be significant. We propose a significance test for the selection found by the Lasso. We introduce a procedure that computes inference and p-values for features chosen by the Lasso. This method rephrases the null hypothesis and uses a randomization approach which ensures that the error rate is controlled even for small samples. We demonstrate the ability of the algorithm to compute pp-values of the expected magnitude with simulated data using a multitude of scenarios that involve various effects strengths and correlation between predictors. The algorithm is also applied to a prostate cancer dataset that has been analyzed in recent papers on the subject. The proposed method is found to provide a powerful way to make inference for feature selection even for small samples and when the number of predictors are several orders of magnitude larger than the number of observations. The algorithm is implemented in the MESS package in R and is freely available
    • ā€¦
    corecore