191,204 research outputs found
Bayesian Adaptive Selection of Variables for Function-on-Scalar Regression Models
Considering the field of functional data analysis, we developed a new
Bayesian method for variable selection in function-on-scalar regression (FOSR).
Our approach uses latent variables, allowing an adaptive selection since it can
determine the number of variables and which ones should be selected for a
function-on-scalar regression model. Simulation studies show the proposed
method's main properties, such as its accuracy in estimating the coefficients
and high capacity to select variables correctly. Furthermore, we conducted
comparative studies with the main competing methods, such as the BGLSS method
as well as the group LASSO, the group MCP and the group SCAD. We also used a
COVID-19 dataset and some socioeconomic data from Brazil for real data
application. In short, the proposed Bayesian variable selection model is
extremely competitive, showing significant predictive and selective quality
Bayesian group Lasso regression for left-censored data
In this paper, a new approach for model selection in left-censored regression has been presented. Specifically, we proposed a new Bayesian group Lasso for variable selection and coefficient estimation in left-censored data (BGLRLC). A new hierarchical Bayesian modeling for group Lasso has introduced, which motivate us to propose a new Gibbs sampler for sampling the parameters from the posteriors. The performance of the proposed approach is examined through simulation studies and a real data analysis. Results show that the proposed approach performs well in comparison to other existing methods
Semi-parametric Bayesian variable selection for gene-environment interactions
Many complex diseases are known to be affected by the interactions between
genetic variants and environmental exposures beyond the main genetic and
environmental effects. Study of gene-environment (GE) interactions is
important for elucidating the disease etiology. Existing Bayesian methods for
GE interaction studies are challenged by the high-dimensional nature of
the study and the complexity of environmental influences. Many studies have
shown the advantages of penalization methods in detecting GE
interactions in "large p, small n" settings. However, Bayesian variable
selection, which can provide fresh insight into GE study, has not been
widely examined. We propose a novel and powerful semi-parametric Bayesian
variable selection model that can investigate linear and nonlinear GE
interactions simultaneously. Furthermore, the proposed method can conduct
structural identification by distinguishing nonlinear interactions from
main-effects-only case within the Bayesian framework. Spike and slab priors are
incorporated on both individual and group levels to identify the sparse main
and interaction effects. The proposed method conducts Bayesian variable
selection more efficiently than existing methods. Simulation shows that the
proposed model outperforms competing alternatives in terms of both
identification and prediction. The proposed Bayesian method leads to the
identification of main and interaction effects with important implications in a
high-throughput profiling study with high-dimensional SNP data
Highly efficient Bayesian joint inversion for receiver-based data and its application to lithospheric structure beneath the southern Korean Peninsula
With the deployment of extensive seismic arrays, systematic and efficient parameter and uncertainty estimation is of increasing importance and can provide reliable, regional models for crustal and upper-mantle structure.We present an efficient Bayesian method for the joint inversion of surface-wave dispersion and receiver-function data that combines trans-dimensional (trans-D) model selection in an optimization phase with subsequent rigorous parameter uncertainty estimation. Parameter and uncertainty estimation depend strongly on the chosen parametrization such that meaningful regional comparison requires quantitative model selection that can be carried out efficiently at several sites. While significant progress has been made for model selection (e.g. trans-D inference) at individual sites, the lack of efficiency can prohibit application to large data volumes or cause questionable results due to lack of convergence. Studies that address large numbers of data sets have mostly ignored model selection in favour of more efficient/simple estimation techniques (i.e. focusing on uncertainty estimation but employing ad-hoc model choices). Our approach consists of a two-phase inversion that combines trans-D optimization to select the most probable parametrization with subsequent Bayesian sampling for uncertainty estimation given that parametrization. The trans-D optimization is implemented here by replacing the likelihood function with the Bayesian information criterion (BIC). The BIC provides constraints on model complexity that facilitate the search for an optimal parametrization. Parallel tempering (PT) is applied as an optimization algorithm. After optimization, the optimal model choice is identified by the minimum BIC value from all PT chains. Uncertainty estimation is then carried out in fixed dimension. Data errors are estimated as part of the inference problem by a combination of empirical and hierarchical estimation. Data covariance matrices are estimated from data residuals (the difference between prediction and observation) and periodically updated. In addition, a scaling factor for the covariance matrix magnitude is estimated as part of the inversion. The inversion is applied to both simulated and observed data that consist of phase- and group-velocity dispersion curves (Rayleigh wave), and receiver functions. The simulation results show that model complexity and important features are well estimated by the fixed dimensional posterior probability density. Observed data for stations in different tectonic regions of the southern Korean Peninsula are considered. The results are consistent with published results, but important features are better constrained than in previous regularized inversions and are more consistent across the stations. For example, resolution of crustal and Moho interfaces, and absolute values and gradients of velocities in lower crust and upper mantle are better constrained
Model selection techniques for sparse weight-based principal component analysis
Many studies make use of multiple types of data that are collected for the same set of samples, resulting in so-called multiblock data (e.g., multiomics studies). A popular analysis framework is sparse principal component analysis (PCA) of the concatenated data. The sparseness in the component weights of these models is usually induced by penalties. A crucial factor in the use of such penalized methods is a proper tuning of the regularization parameters used to give more or less weight to the penalties. In this paper, we examine several model selection procedures to tune these regularization parameters for sparse PCA. The model selection procedures include cross-validation, Bayesian information criterion (BIC), index of sparseness, and the convex hull procedure. Furthermore, to account for the multiblock structure, we present a sparse PCA algorithm with a group least absolute shrinkage and selection operator (LASSO) penalty added to it, to either select or cancel out blocks of data in an automated way. Also, the tuning of the group LASSO parameter is studied for the proposed model selection procedures. We conclude that when the component weights are to be interpreted, cross-validation with the one standard error rule is preferred; alternatively, if the interest lies in obtaining component scores using a very limited set of variables, the convex hull, BIC, and index of sparseness are all suitable
High-dimensional variable selection for genomics data, from both frequentist and Bayesian perspectives
Doctor of PhilosophyDepartment of StatisticsCen WuVariable selection is one of the most popular tools for analyzing high-dimensional genomic data. It has been developed to accommodate complex data structures and lead to structured sparse identification of important genomics features. We focus on the network and interaction structure that commonly exist in genomic data, and develop novel variable selection methods from both frequentist and Bayesian perspectives.
Network-based regularization has achieved success in variable selections for high-dimensional cancer genomic data, due to its ability to incorporate the correlations among genomic features. However, as survival time data usually follow skewed distributions, and are contaminated by outliers, network-constrained regularization that does not take the robustness into account leads to false identifications of network structure and biased estimation of patients’ survival. In the first project, we develop a novel robust network-based variable selection method under the accelerated failure time (AFT) model. Extensive simulation studies show the advantage of the proposed method over the alternative methods. Promising findings are made in two case studies of lung cancer datasets with high dimensional gene expression measurements.
Gene-environment (GĂ—E) interactions are important for the elucidation of disease etiology beyond the main genetic and environmental effects. In the second project, a novel and powerful semi-parametric Bayesian variable selection model has been proposed to investigate linear and nonlinear GĂ—E interactions simultaneously. It can further conduct structural identification by distinguishing nonlinear interactions from main-effects-only case within the Bayesian framework. The proposed method conducts Bayesian variable selection more efficiently and accurately than alternatives. Simulation shows that the proposed model outperforms competing alternatives in terms of both identification and prediction. In the case study, the proposed Bayesian method leads to the identification of effects with important implications in a high-throughput profiling study with high-dimensional SNP data.
In the last project, a robust Bayesian variable selection method has been developed for G×E interaction studies. The proposed robust Bayesian method can effectively accommodate heavy-tailed errors and outliers in the response variable while conducting variable selection by accounting for structural sparsity. Spike and slab priors are incorporated on both individual and group levels to identify the sparse main and interaction effects. Extensive simulation studies and analysis of both the diabetes data with SNP measurements from the Nurses’ Health Study and TCGA melanoma data with gene expression measurements demonstrate the superior performance of the proposed method over multiple competing alternatives.
To facilitate reproducible research and fast computation, we have developed open source R packages for each project, which provide highly efficient C++ implementation for all the proposed and alternative approaches. The R packages regnet and spinBayes, associated with the first and second project correspondingly, are available on CRAN. For the third project, the R package robin is available from GitHub and will be submitted to CRAN soon
Statistical methods in detecting differential expressed genes, analyzing insertion tolerance for genes and group selection for survival data
The thesis is composed of three independent projects: (i) analyzing transposon-sequencing data to infer functions of genes on bacteria growth (chapter 2), (ii) developing semi-parametric Bayesian method method for differential gene expression analysis with RNA-sequencing data (chapter 3), (iii) solving group selection problem for survival data (chapter 4). All projects
are motivated by statistical challenges raised in biological research.
The first project is motivated by the need to develop statistical models to accommodate the transposon insertion sequencing (Tn-Seq) data, Tn-Seq data consist of sequence reads around each transposon insertion site.
The detection of transposon insertion at a given site indicates that the disruption of genomic sequence at this site does not cause essential function loss and the bacteria can still grow.
Hence, such measurements have been used to infer the functions of each gene on bacteria growth. We propose a zero-inflated Poisson regression method for analyzing the Tn-Seq count data, and derive an Expectation-Maximization (EM) algorithm to obtain parameter estimates. We also propose a multiple testing procedure that categorizes genes into each of the three states, hypo-tolerant, tolerant, and hyper-tolerant, while controlling false discovery rate. Simulation studies show our method provides good
estimation of model parameters and inference on gene functions.
In the second project, we model the count data from RNA-sequencing experiment for each gene using a Poisson-Gamma hierarchical model, or equivalently, a negative binomial (NB) model. We derive a full semi-parametric Bayesian approach with Dirichlet process as the prior for the fold changes between two treatment means. An inference strategy using Gibbs algorithm is developed for differential expression analysis. We evaluate our method with several simulation studies, and the results demonstrate that our method outperforms other methods including the popularly applied ones such as edgeR and DESeq.
In the third project, we develop a new semi-parametric Bayesian method to address the group variable selection problem and study the dependence of survival outcomes on the grouped predictors using the Cox proportional hazard model. We use indicators for groups to induce sparseness and obtain the posterior inclusion probability for each group. Bayes factors are used to evaluate whether the groups should be selected or not. We compare our method with one frequentist method (HPCox) based on several simulation studies and show that our method performs better than HPCox method.
In summary, this dissertation tackles several statistical problems raised in biological research, including high-dimensional genomic data analysis and survival analysis. All proposed methods are evaluated with simulation studies and show satisfactory performances. We also apply the proposed methods to real data analysis
EVALUASI MODEL-MODEL BAYESIAN SPASIAL CONDITIONAL AUTOREGRESSIVE UNTUK PEMODELAN KASUS KEMATIAN CORONA VIRUS DISEASE (COVID-19) DI INDONESIA
Covid-19 cases in Indonesia occurred for the first time on 2 March 2020. By 30 September 2022, Indonesia had 158,173 Covid-19 deaths. Several studies have been done in modelling Covid-19 cases. However, research in modelling the number of Covid-19 deaths using the Bayesian Spatial Conditional Autoregressive (CAR) model is still rare. The Bayesian spatial CAR model has high flexibility in relative risk (RR) modeling. CAR models can include various types of spatial effects and can include covariates in the model. RR represents the ratio of the risk of outcome (Covid-19) in the exposed group compared to the population average (the unexposed group). This study aims to evaluate the BYM, Leroux, and Localised models with five hyperpriors, to obtain the best model for estimating the RR of Covid-19 deaths in Indonesia and to create RR maps. This study used aggregate data on Covid-19 deaths (2 March 2020 - 30 September 2022). Data on the total population and population density of each province in 2021 were also used. The best model selection is based on the lowest Watanabe Akaike Information Criterion (WAIC) and Deviance Information Criterion (DIC) values, and Modified Moran's I (MMI) residual values. The result showed that the CAR BYM model with covariates and with Inverse-Gamma IG(0.5; 0.0005) prior distribution had the lowest DIC and WAIC. As the BYM model does not converge, the model cannot be used in determining the RR of Covid-19 deaths in Indonesia. From the other three models that converge, the Bayesian CAR Leroux model without covariate with IG(0,5;0,0005) has the lowest DIC(393,76), and WAIC(400,12), and its MMI value (-0,26) is approximate to zero. Therefore, the Bayesian CAR Leroux model without covariate with IG(0,5;0,0005) is preferred. The province with the highest RR (2,76) and the lowest RR (0,22) are Yogyakarta and Papua, respectively
BAYESIAN MODELING OF CENSORED DATA WITH APPLICATION TO META-ANALYSIS OF IMMUNOTHERAPY TRIALS
My dissertation builds on a systematic review of 125 clinical trials reporting on treatment-related adverse events (AEs) associated with PD-1/PD-L1 inhibitors published from 2010 to 2018. The motivating dataset contained the following study-level components extracted from each publication: trial name, number of treated patients, selected immunotherapy drug, dosing schedule, cancer type, number of AEs within each category, and the pre-specified criteria for AE reporting. The number of AEs were reported based upon all-grade (Grade 1-5) and Grade 3 or higher (Grade 3-5) severity. My overall objective was to increase our understanding of the toxicity profiles of five most common cancer immunotherapy drugs, and to evaluate AE incidence across subgroups in a meta-analysis setting. However, for assessing drug safety in clinical trials, a common challenge is that many published clinical studies do not report rare AEs. In particular, if the number of AEs observed is lower than a pre-specified cutoff value, these events may not always be reported in the publication (i.e., they are censored). My doctoral dissertation research, thus, proposes an innovative statistical methodology for effectively handling censored rare AEs in the context of meta-analysis of immunotherapy trials. First, by deriving exact inference and robust estimates for the missing not at random data, we proposed a Bayesian multilevel regression model in the coarsened data framework to accommodate censored rare event data. We also demonstrated that if the censored information was ignored, the incidence probability of AEs would be overestimated. Second, to select the best Bayesian censored data model among a set of candidate models in the presence of complicated or high-dimensional features, we proposed an alternative strategy to implement Bayesian model selection for censored data analysis in Just Another Gibbs Sampling (JAGS). To generate deviance samples from a Bayesian model using JAGS, if censoring occurs, an existing function incorrectly calculates the value of deviance function because of the “wrong focus”, i.e., the incorrect likelihood computed on the basis of model specification in JAGS. Therefore, we proposed a strategy to establish a simultaneous way to calculate the true value of deviance function in JAGS. The alternative strategy could be generalized to model other types of data and be applied to many other disciplines. Third, we developed a sparse Bayesian selection model with prior specifications on meta-analysis of censored rare AEs to perform selection of pairwise interactions between various study-level factors. Because the toxicity profiles of immunotherapy drugs may not be explained comprehensively by main effects of study-level factors, we identified the high-risk group by considering two-way interactions that impact the outcome of interest. Through simulation studies, we demonstrated that the proposed interaction selection method outperforms others in prediction accuracy and interaction identification in the presence of missing outcome data. Lastly, we also applied the proposed method to our real-world motivating dataset. In sum, my dissertation work makes significant and innovative contributions to the field of applied statistics and cancer research
- …