4 research outputs found

    Large-scale variational inference for Bayesian joint regression modelling of high-dimensional genetic data

    Get PDF
    Genetic association studies have become increasingly important in understanding the molecular bases of complex human traits. The specific analysis of intermediate molecular traits, via quantitative trait locus (QTL) studies, has recently received much attention, prompted by the advance of high-throughput technologies for quantifying gene, protein and metabolite levels. Of great interest is the detection of weak trans-regulatory effects between a genetic variant and a distal gene product. In particular, hotspot genetic variants, which remotely control the levels of many molecular outcomes, may initiate decisive functional mechanisms underlying disease endpoints. This thesis proposes a Bayesian hierarchical approach for joint analysis of QTL data on a genome-wide scale. We consider a series of parallel sparse regressions combined in a hierarchical manner to flexibly accommodate high-dimensional responses (molecular levels) and predictors (genetic variants), and we present new methods for large-scale inference. Existing approaches have limitations. Conventional marginal screening does not account for local dependencies and association patterns common to multiple outcomes and genetic variants, whereas joint modelling approaches are restricted to relatively small datasets by computational constraints. Our novel framework allows information-sharing across outcomes and variants, thereby enhancing the detection of weak trans and hotspot effects, and implements tailored variational inference procedures that allow simultaneous analysis of data for an entire QTL study, comprising hundreds of thousands of predictors, and thousands of responses and samples. The present work also describes extensions to leverage spatial and functional information on the genetic variants, for example, using predictor-level covariates such as epigenomic marks. Moreover, we augment variational inference with simulated annealing and parallel expectation-maximisation schemes in order to enhance exploration of highly multimodal spaces and allow efficient empirical Bayes estimation. Our methods, publicly available as packages implemented in R and C++, are extensively assessed in realistic simulations. Their advantages are illustrated in several QTL applications, including a large-scale proteomic QTL study on two clinical cohorts that highlights novel candidate biomarkers for metabolic disorders

    Big Data Analytics and Information Science for Business and Biomedical Applications

    Get PDF
    The analysis of Big Data in biomedical as well as business and financial research has drawn much attention from researchers worldwide. This book provides a platform for the deep discussion of state-of-the-art statistical methods developed for the analysis of Big Data in these areas. Both applied and theoretical contributions are showcased

    Penalized Likelihood Estimation of Trivariate Additive Binary Models

    Get PDF
    In many empirical situations, modelling simultaneously three or more outcomes as well as their dependence structure can be of considerable relevance. Trivariate modelling is continually gaining in popularity (e.g., Genest et al., 2013; Król et al., 2016; Zhong et al., 2012) because of the appealing property to account for the endogeneity issue and non-random sample selection bias, two issues that commonly arise in empirical studies (e.g., Zhang et al., 2015; Radice et al., 2013; Marra et al., 2017; Bärnighausen et al., 2011). The applied and methodological interest in trivariate modelling motivates the current thesis and the aim is to develop and estimate a generalized trivariate binary regression model, which accounts for several types of covariate effects (such as linear, nonlinear, random and spatial effects), as well as error correlations. In particular, the thesis focuses on the following targets. First, we address the issue in estimating accurately the correlation coefficients, which characterize the dependence of the binary responses conditional on regressors. We found that this is not an unusual occurrence for trivariate binary models and as far as we know such a limitation is neither discussed nor dealt with. Based on this framework, we develop models for dealing with data suffering from endogeneity and/or nonrandom sample selection. Moreover, we propose trivariate Gaussian copula models where the link functions can in principle be derived from any parametric distribution and the parameters describing the association between the responses can be made dependent on several types of covariate effects. All the coefficients of the model are estimated simultaneously within a penalized likelihood framework based on a carefully structured trust region algorithm with integrated automatic multiple smoothing parameter selection. The developments have been incorporated in the function SemiParTRIV()/gjrm() in the R package GJRM (Marra & Radice, 2017). The extensive use of simulated data as well as real datasets illustrates each development in detail and completes the analysis

    SIS 2017. Statistics and Data Science: new challenges, new generations

    Get PDF
    The 2017 SIS Conference aims to highlight the crucial role of the Statistics in Data Science. In this new domain of ‘meaning’ extracted from the data, the increasing amount of produced and available data in databases, nowadays, has brought new challenges. That involves different fields of statistics, machine learning, information and computer science, optimization, pattern recognition. These afford together a considerable contribute in the analysis of ‘Big data’, open data, relational and complex data, structured and no-structured. The interest is to collect the contributes which provide from the different domains of Statistics, in the high dimensional data quality validation, sampling extraction, dimensional reduction, pattern selection, data modelling, testing hypotheses and confirming conclusions drawn from the data
    corecore