191,204 research outputs found

    Bayesian Adaptive Selection of Variables for Function-on-Scalar Regression Models

    Full text link
    Considering the field of functional data analysis, we developed a new Bayesian method for variable selection in function-on-scalar regression (FOSR). Our approach uses latent variables, allowing an adaptive selection since it can determine the number of variables and which ones should be selected for a function-on-scalar regression model. Simulation studies show the proposed method's main properties, such as its accuracy in estimating the coefficients and high capacity to select variables correctly. Furthermore, we conducted comparative studies with the main competing methods, such as the BGLSS method as well as the group LASSO, the group MCP and the group SCAD. We also used a COVID-19 dataset and some socioeconomic data from Brazil for real data application. In short, the proposed Bayesian variable selection model is extremely competitive, showing significant predictive and selective quality

    Bayesian group Lasso regression for left-censored data

    Get PDF
    In this paper, a new approach for model selection in left-censored regression has been presented. Specifically, we proposed a new Bayesian group Lasso for variable selection and coefficient estimation in left-censored data (BGLRLC). A new hierarchical Bayesian modeling for group Lasso has introduced, which motivate us to propose a new Gibbs sampler for sampling the parameters from the posteriors. The performance of the proposed approach is examined through simulation studies and a real data analysis. Results show that the proposed approach performs well in comparison to other existing methods

    Semi-parametric Bayesian variable selection for gene-environment interactions

    Full text link
    Many complex diseases are known to be affected by the interactions between genetic variants and environmental exposures beyond the main genetic and environmental effects. Study of gene-environment (GĂ—\timesE) interactions is important for elucidating the disease etiology. Existing Bayesian methods for GĂ—\timesE interaction studies are challenged by the high-dimensional nature of the study and the complexity of environmental influences. Many studies have shown the advantages of penalization methods in detecting GĂ—\timesE interactions in "large p, small n" settings. However, Bayesian variable selection, which can provide fresh insight into GĂ—\timesE study, has not been widely examined. We propose a novel and powerful semi-parametric Bayesian variable selection model that can investigate linear and nonlinear GĂ—\timesE interactions simultaneously. Furthermore, the proposed method can conduct structural identification by distinguishing nonlinear interactions from main-effects-only case within the Bayesian framework. Spike and slab priors are incorporated on both individual and group levels to identify the sparse main and interaction effects. The proposed method conducts Bayesian variable selection more efficiently than existing methods. Simulation shows that the proposed model outperforms competing alternatives in terms of both identification and prediction. The proposed Bayesian method leads to the identification of main and interaction effects with important implications in a high-throughput profiling study with high-dimensional SNP data

    Highly efficient Bayesian joint inversion for receiver-based data and its application to lithospheric structure beneath the southern Korean Peninsula

    Get PDF
    With the deployment of extensive seismic arrays, systematic and efficient parameter and uncertainty estimation is of increasing importance and can provide reliable, regional models for crustal and upper-mantle structure.We present an efficient Bayesian method for the joint inversion of surface-wave dispersion and receiver-function data that combines trans-dimensional (trans-D) model selection in an optimization phase with subsequent rigorous parameter uncertainty estimation. Parameter and uncertainty estimation depend strongly on the chosen parametrization such that meaningful regional comparison requires quantitative model selection that can be carried out efficiently at several sites. While significant progress has been made for model selection (e.g. trans-D inference) at individual sites, the lack of efficiency can prohibit application to large data volumes or cause questionable results due to lack of convergence. Studies that address large numbers of data sets have mostly ignored model selection in favour of more efficient/simple estimation techniques (i.e. focusing on uncertainty estimation but employing ad-hoc model choices). Our approach consists of a two-phase inversion that combines trans-D optimization to select the most probable parametrization with subsequent Bayesian sampling for uncertainty estimation given that parametrization. The trans-D optimization is implemented here by replacing the likelihood function with the Bayesian information criterion (BIC). The BIC provides constraints on model complexity that facilitate the search for an optimal parametrization. Parallel tempering (PT) is applied as an optimization algorithm. After optimization, the optimal model choice is identified by the minimum BIC value from all PT chains. Uncertainty estimation is then carried out in fixed dimension. Data errors are estimated as part of the inference problem by a combination of empirical and hierarchical estimation. Data covariance matrices are estimated from data residuals (the difference between prediction and observation) and periodically updated. In addition, a scaling factor for the covariance matrix magnitude is estimated as part of the inversion. The inversion is applied to both simulated and observed data that consist of phase- and group-velocity dispersion curves (Rayleigh wave), and receiver functions. The simulation results show that model complexity and important features are well estimated by the fixed dimensional posterior probability density. Observed data for stations in different tectonic regions of the southern Korean Peninsula are considered. The results are consistent with published results, but important features are better constrained than in previous regularized inversions and are more consistent across the stations. For example, resolution of crustal and Moho interfaces, and absolute values and gradients of velocities in lower crust and upper mantle are better constrained

    Model selection techniques for sparse weight-based principal component analysis

    Get PDF
    Many studies make use of multiple types of data that are collected for the same set of samples, resulting in so-called multiblock data (e.g., multiomics studies). A popular analysis framework is sparse principal component analysis (PCA) of the concatenated data. The sparseness in the component weights of these models is usually induced by penalties. A crucial factor in the use of such penalized methods is a proper tuning of the regularization parameters used to give more or less weight to the penalties. In this paper, we examine several model selection procedures to tune these regularization parameters for sparse PCA. The model selection procedures include cross-validation, Bayesian information criterion (BIC), index of sparseness, and the convex hull procedure. Furthermore, to account for the multiblock structure, we present a sparse PCA algorithm with a group least absolute shrinkage and selection operator (LASSO) penalty added to it, to either select or cancel out blocks of data in an automated way. Also, the tuning of the group LASSO parameter is studied for the proposed model selection procedures. We conclude that when the component weights are to be interpreted, cross-validation with the one standard error rule is preferred; alternatively, if the interest lies in obtaining component scores using a very limited set of variables, the convex hull, BIC, and index of sparseness are all suitable

    High-dimensional variable selection for genomics data, from both frequentist and Bayesian perspectives

    Get PDF
    Doctor of PhilosophyDepartment of StatisticsCen WuVariable selection is one of the most popular tools for analyzing high-dimensional genomic data. It has been developed to accommodate complex data structures and lead to structured sparse identification of important genomics features. We focus on the network and interaction structure that commonly exist in genomic data, and develop novel variable selection methods from both frequentist and Bayesian perspectives. Network-based regularization has achieved success in variable selections for high-dimensional cancer genomic data, due to its ability to incorporate the correlations among genomic features. However, as survival time data usually follow skewed distributions, and are contaminated by outliers, network-constrained regularization that does not take the robustness into account leads to false identifications of network structure and biased estimation of patients’ survival. In the first project, we develop a novel robust network-based variable selection method under the accelerated failure time (AFT) model. Extensive simulation studies show the advantage of the proposed method over the alternative methods. Promising findings are made in two case studies of lung cancer datasets with high dimensional gene expression measurements. Gene-environment (G×E) interactions are important for the elucidation of disease etiology beyond the main genetic and environmental effects. In the second project, a novel and powerful semi-parametric Bayesian variable selection model has been proposed to investigate linear and nonlinear G×E interactions simultaneously. It can further conduct structural identification by distinguishing nonlinear interactions from main-effects-only case within the Bayesian framework. The proposed method conducts Bayesian variable selection more efficiently and accurately than alternatives. Simulation shows that the proposed model outperforms competing alternatives in terms of both identification and prediction. In the case study, the proposed Bayesian method leads to the identification of effects with important implications in a high-throughput profiling study with high-dimensional SNP data. In the last project, a robust Bayesian variable selection method has been developed for G×E interaction studies. The proposed robust Bayesian method can effectively accommodate heavy-tailed errors and outliers in the response variable while conducting variable selection by accounting for structural sparsity. Spike and slab priors are incorporated on both individual and group levels to identify the sparse main and interaction effects. Extensive simulation studies and analysis of both the diabetes data with SNP measurements from the Nurses’ Health Study and TCGA melanoma data with gene expression measurements demonstrate the superior performance of the proposed method over multiple competing alternatives. To facilitate reproducible research and fast computation, we have developed open source R packages for each project, which provide highly efficient C++ implementation for all the proposed and alternative approaches. The R packages regnet and spinBayes, associated with the first and second project correspondingly, are available on CRAN. For the third project, the R package robin is available from GitHub and will be submitted to CRAN soon

    Statistical methods in detecting differential expressed genes, analyzing insertion tolerance for genes and group selection for survival data

    Get PDF
    The thesis is composed of three independent projects: (i) analyzing transposon-sequencing data to infer functions of genes on bacteria growth (chapter 2), (ii) developing semi-parametric Bayesian method method for differential gene expression analysis with RNA-sequencing data (chapter 3), (iii) solving group selection problem for survival data (chapter 4). All projects are motivated by statistical challenges raised in biological research. The first project is motivated by the need to develop statistical models to accommodate the transposon insertion sequencing (Tn-Seq) data, Tn-Seq data consist of sequence reads around each transposon insertion site. The detection of transposon insertion at a given site indicates that the disruption of genomic sequence at this site does not cause essential function loss and the bacteria can still grow. Hence, such measurements have been used to infer the functions of each gene on bacteria growth. We propose a zero-inflated Poisson regression method for analyzing the Tn-Seq count data, and derive an Expectation-Maximization (EM) algorithm to obtain parameter estimates. We also propose a multiple testing procedure that categorizes genes into each of the three states, hypo-tolerant, tolerant, and hyper-tolerant, while controlling false discovery rate. Simulation studies show our method provides good estimation of model parameters and inference on gene functions. In the second project, we model the count data from RNA-sequencing experiment for each gene using a Poisson-Gamma hierarchical model, or equivalently, a negative binomial (NB) model. We derive a full semi-parametric Bayesian approach with Dirichlet process as the prior for the fold changes between two treatment means. An inference strategy using Gibbs algorithm is developed for differential expression analysis. We evaluate our method with several simulation studies, and the results demonstrate that our method outperforms other methods including the popularly applied ones such as edgeR and DESeq. In the third project, we develop a new semi-parametric Bayesian method to address the group variable selection problem and study the dependence of survival outcomes on the grouped predictors using the Cox proportional hazard model. We use indicators for groups to induce sparseness and obtain the posterior inclusion probability for each group. Bayes factors are used to evaluate whether the groups should be selected or not. We compare our method with one frequentist method (HPCox) based on several simulation studies and show that our method performs better than HPCox method. In summary, this dissertation tackles several statistical problems raised in biological research, including high-dimensional genomic data analysis and survival analysis. All proposed methods are evaluated with simulation studies and show satisfactory performances. We also apply the proposed methods to real data analysis

    EVALUASI MODEL-MODEL BAYESIAN SPASIAL CONDITIONAL AUTOREGRESSIVE UNTUK PEMODELAN KASUS KEMATIAN CORONA VIRUS DISEASE (COVID-19) DI INDONESIA

    Get PDF
    Covid-19 cases in Indonesia occurred for the first time on 2 March 2020. By 30 September 2022, Indonesia had 158,173 Covid-19 deaths. Several studies have been done in modelling Covid-19 cases. However, research in modelling the number of Covid-19 deaths using the Bayesian Spatial Conditional Autoregressive (CAR) model is still rare. The Bayesian spatial CAR model has high flexibility in relative risk (RR) modeling. CAR models can include various types of spatial effects and can include covariates in the model. RR represents the ratio of the risk of outcome (Covid-19) in the exposed group compared to the population average (the unexposed group). This study aims to evaluate the BYM, Leroux, and Localised models with five hyperpriors, to obtain the best model for estimating the RR of Covid-19 deaths in Indonesia and to create RR maps. This study used aggregate data on Covid-19 deaths (2 March 2020 - 30 September 2022). Data on the total population and population density of each province in 2021 were also used. The best model selection is based on the lowest Watanabe Akaike Information Criterion (WAIC) and Deviance Information Criterion (DIC) values, and Modified Moran's I (MMI) residual values. The result showed that the CAR BYM model with covariates and with Inverse-Gamma IG(0.5; 0.0005) prior distribution had the lowest DIC and WAIC. As the BYM model does not converge, the model cannot be used in determining the RR of Covid-19 deaths in Indonesia. From the other three models that converge, the Bayesian CAR Leroux model without covariate with IG(0,5;0,0005) has the lowest DIC(393,76), and WAIC(400,12), and its MMI value (-0,26) is approximate to zero. Therefore, the Bayesian CAR Leroux model without covariate with IG(0,5;0,0005) is preferred. The province with the highest RR (2,76) and the lowest RR (0,22) are Yogyakarta and Papua, respectively

    BAYESIAN MODELING OF CENSORED DATA WITH APPLICATION TO META-ANALYSIS OF IMMUNOTHERAPY TRIALS

    Get PDF
    My dissertation builds on a systematic review of 125 clinical trials reporting on treatment-related adverse events (AEs) associated with PD-1/PD-L1 inhibitors published from 2010 to 2018. The motivating dataset contained the following study-level components extracted from each publication: trial name, number of treated patients, selected immunotherapy drug, dosing schedule, cancer type, number of AEs within each category, and the pre-specified criteria for AE reporting. The number of AEs were reported based upon all-grade (Grade 1-5) and Grade 3 or higher (Grade 3-5) severity. My overall objective was to increase our understanding of the toxicity profiles of five most common cancer immunotherapy drugs, and to evaluate AE incidence across subgroups in a meta-analysis setting. However, for assessing drug safety in clinical trials, a common challenge is that many published clinical studies do not report rare AEs. In particular, if the number of AEs observed is lower than a pre-specified cutoff value, these events may not always be reported in the publication (i.e., they are censored). My doctoral dissertation research, thus, proposes an innovative statistical methodology for effectively handling censored rare AEs in the context of meta-analysis of immunotherapy trials. First, by deriving exact inference and robust estimates for the missing not at random data, we proposed a Bayesian multilevel regression model in the coarsened data framework to accommodate censored rare event data. We also demonstrated that if the censored information was ignored, the incidence probability of AEs would be overestimated. Second, to select the best Bayesian censored data model among a set of candidate models in the presence of complicated or high-dimensional features, we proposed an alternative strategy to implement Bayesian model selection for censored data analysis in Just Another Gibbs Sampling (JAGS). To generate deviance samples from a Bayesian model using JAGS, if censoring occurs, an existing function incorrectly calculates the value of deviance function because of the “wrong focus”, i.e., the incorrect likelihood computed on the basis of model specification in JAGS. Therefore, we proposed a strategy to establish a simultaneous way to calculate the true value of deviance function in JAGS. The alternative strategy could be generalized to model other types of data and be applied to many other disciplines. Third, we developed a sparse Bayesian selection model with prior specifications on meta-analysis of censored rare AEs to perform selection of pairwise interactions between various study-level factors. Because the toxicity profiles of immunotherapy drugs may not be explained comprehensively by main effects of study-level factors, we identified the high-risk group by considering two-way interactions that impact the outcome of interest. Through simulation studies, we demonstrated that the proposed interaction selection method outperforms others in prediction accuracy and interaction identification in the presence of missing outcome data. Lastly, we also applied the proposed method to our real-world motivating dataset. In sum, my dissertation work makes significant and innovative contributions to the field of applied statistics and cancer research
    • …
    corecore