391 research outputs found
Estimation and Regularization Techniques for Regression Models with Multidimensional Prediction Functions
Boosting is one of the most important methods for fitting
regression models and building prediction rules from
high-dimensional data. A notable feature of boosting is that the
technique has a built-in mechanism for shrinking coefficient
estimates and variable selection. This regularization mechanism
makes boosting a suitable method for analyzing data characterized by
small sample sizes and large numbers of predictors. We extend the
existing methodology by developing a boosting method for prediction
functions with multiple components. Such multidimensional functions
occur in many types of statistical models, for example in count data
models and in models involving outcome variables with a mixture
distribution. As will be demonstrated, the new algorithm is suitable
for both the estimation of the prediction function and
regularization of the estimates. In addition, nuisance parameters
can be estimated simultaneously with the prediction function
A Simulation Study to Evaluate Bayesian LASSOās Performance in Zero-Inflated Poisson (ZIP) Models
When modelling count data, it is possible to have excessive zeros in the data in many applications. My thesis concentrates on the variable selection in zero-inflated Poisson (ZIP) models. This thesis work is motivated by Brown et al. (2015), who considered the excessive amount of zero in their data structure and the site-specific random effects, and used Bayesian LASSO method for variable selection in their post-fire tree recruitment study in interior Alaska, USA and north Yukon, Canada. However, the above study has not carried out systematic simulation studies to evaluate Bayesian LASSOās performance under different scenarios. Therefore, my thesis conducts a series of simulation studies to evaluate Bayesian LASSOās performance with respect to different setting of some simulation factors.
My thesis considers three simulation factors: the number of subjects (N), the number of repeated measurements (R) and the true values of regression coefficients in the ZIP models. With different settings of the three factors, the proposed Bayesian LASSOās performance would be evaluated using three indicators: the sensitivity, the specificity and the exact fit rate. For applied practitioners, my thesis would be a useful example demonstrating under what circumstances one can expect Bayesian LASSO to have good performance in ZIP models. After sorting out the simulation results, we can find that Bayesian LASSOās performance is jointly affected by all the three simulation factors, while this method of variable selection is more reliable when the true coefficients are not close to zero.
My thesis also has some limitations. Primarily, with the time limitation of my thesis, it is impossible to consider all the factors that can potentially affect the simulation results, and using other penalty forms other than L1 penalty is also left for future researchers to work on. Moreover, the current variable selection method is only for fixed effects selection while the variable selection for the mixed effect selection in ZIP models can be a direction for future work
Estimation and Feature Selection in Mixtures of Generalized Linear Experts Models
Mixtures-of-Experts (MoE) are conditional mixture models that have shown
their performance in modeling heterogeneity in data in many statistical
learning approaches for prediction, including regression and classification, as
well as for clustering. Their estimation in high-dimensional problems is still
however challenging. We consider the problem of parameter estimation and
feature selection in MoE models with different generalized linear experts
models, and propose a regularized maximum likelihood estimation that
efficiently encourages sparse solutions for heterogeneous data with
high-dimensional predictors. The developed proximal-Newton EM algorithm
includes proximal Newton-type procedures to update the model parameter by
monotonically maximizing the objective function and allows to perform efficient
estimation and feature selection. An experimental study shows the good
performance of the algorithms in terms of recovering the actual sparse
solutions, parameter estimation, and clustering of heterogeneous regression
data, compared to the main state-of-the art competitors.Comment: arXiv admin note: text overlap with arXiv:1810.1216
Bayesian Conditional Auto-Regressive LASSO Models to Learn Sparse Biological Networks with Predictors
Microbiome data analyses require statistical models that can simultaneously
decode microbes' reaction to the environment and interactions among microbes.
While a multiresponse linear regression model seems like a straight-forward
solution, we argue that treating it as a graphical model is flawed given that
the regression coefficient matrix does not encode the conditional dependence
structure between response and predictor nodes because it does not represent
the adjacency matrix. This observation is especially important in biological
settings when we have prior knowledge on the edges from specific experimental
interventions that can only be properly encoded under a conditional dependence
model. Here, we propose a chain graph model with two sets of nodes (predictors
and responses) whose solution yields a graph with edges that indeed represent
conditional dependence and thus, agrees with the experimenter's intuition on
average behavior of nodes under treatment. The solution to our model is sparse
via Bayesian LASSO and is also guaranteed to be the sparse solution to a
Conditional Auto-Regressive (CAR) model. In addition, we propose an adaptive
extension so that different shrinkage can be applied to different edges to
incorporate edge-specific prior knowledge. Our model is computationally
inexpensive through an efficient Gibbs sampling algorithm and can account for
binary, counting and compositional responses via appropriate hierarchical
structure. Finally, we apply our model to a human gut and a soil microbial
composition datasets
Inference for feature selection using the Lasso with high-dimensional data
Penalized regression models such as the Lasso have proved useful for variable
selection in many fields - especially for situations with high-dimensional data
where the numbers of predictors far exceeds the number of observations. These
methods identify and rank variables of importance but do not generally provide
any inference of the selected variables. Thus, the variables selected might be
the "most important" but need not be significant. We propose a significance
test for the selection found by the Lasso. We introduce a procedure that
computes inference and p-values for features chosen by the Lasso. This method
rephrases the null hypothesis and uses a randomization approach which ensures
that the error rate is controlled even for small samples. We demonstrate the
ability of the algorithm to compute -values of the expected magnitude with
simulated data using a multitude of scenarios that involve various effects
strengths and correlation between predictors. The algorithm is also applied to
a prostate cancer dataset that has been analyzed in recent papers on the
subject. The proposed method is found to provide a powerful way to make
inference for feature selection even for small samples and when the number of
predictors are several orders of magnitude larger than the number of
observations. The algorithm is implemented in the MESS package in R and is
freely available
- ā¦