69 research outputs found
Statistical methods for gene selection and genetic association studies
This dissertation includes five Chapters. A brief description of each chapter is organized as follows.
In Chapter One, we propose a signed bipartite genotype and phenotype network (GPN) by linking phenotypes and genotypes based on the statistical associations. It provides a new insight to investigate the genetic architecture among multiple correlated phenotypes and explore where phenotypes might be related at a higher level of cellular and organismal organization. We show that multiple phenotypes association studies by considering the proposed network are improved by incorporating the genetic information into the phenotype clustering.
In Chapter Two, we first illustrate the proposed GPN to GWAS summary statistics. Then, we assess contributions to constructing a well-defined GPN with a clear representation of genetic associations by comparing the network properties with a random network, including connectivity, centrality, and community structure. The network topology annotations based on the sparse representations of GPN can be used to understand the disease heritability for the highly correlated phenotypes. In applications of phenome-wide association studies, the proposed GPN can identify more significant pairs of genetic variant and phenotype categories.
In Chapter Three, a powerful and computationally efficient gene-based association test is proposed, aggregating information from different gene-based association tests and also incorporating expression quantitative trait locus information. We show that the proposed method controls the type I error rates very well and has higher power in the simulation studies and can identify more significant genes in the real data analyses.
In Chapter Four, we develop six statistical selection methods based on the penalized regression for inferring target genes of a transcription factor (TF). In this study, the proposed selection methods combine statistics, machine learning , and convex optimization approach, which have great efficacy in identifying the true target genes. The methods will fill the gap of lacking the appropriate methods for predicting target genes of a TF, and are instrumental for validating experimental results yielding from ChIP-seq and DAP-seq, and conversely, selection and annotation of TFs based on their target genes.
In Chapter Five, we propose a gene selection approach by capturing gene-level signals in network-based regression into case-control association studies with DNA sequence data or DNA methylation data, inspired by the popular gene-based association tests using a weighted combination of genetic variants to capture the combined effect of individual genetic variants within a gene. We show that the proposed gene selection approach have higher true positive rates than using traditional dimension reduction techniques in the simulation studies and select potentially rheumatoid arthritis related genes that are missed by existing methods
Statistical Methods in Integrative Genomics
Statistical methods in integrative genomics aim to answer important biology questions by jointly analyzing multiple types of genomic data (vertical integration) or aggregating the same type of data across multiple studies (horizontal integration). In this article, we introduce different types of genomic data and data resources, and then review statistical methods of integrative genomics, with emphasis on the motivation and rationale of these methods. We conclude with some summary points and future research directions
Bayesian indicator variable selection of multivariate response with heterogeneous sparsity for multi-trait fine mapping
Variable selection has been played a critical role in contemporary statistics
and scientific discoveries. Numerous regularization and Bayesian variable
selection methods have been developed in the past two decades for variable
selection, but they mainly target at only one response. As more data being
collected nowadays, it is common to obtain and analyze multiple correlated
responses from the same study. Running separate regression for each response
ignores their correlation thus multivariate analysis is recommended. Existing
multivariate methods select variables related to all responses without
considering the possible heterogeneous sparsity of different responses, i.e.
some features may only predict a subset of responses but not the rest. In this
paper, we develop a novel Bayesian indicator variable selection method in
multivariate regression model with a large number of grouped predictors
targeting at multiple correlated responses with possibly heterogeneous sparsity
patterns. The method is motivated by the multi-trait fine mapping problem in
genetics to identify the variants that are causal to multiple related traits.
Our new method is featured by its selection at individual level, group level as
well as specific to each response. In addition, we propose a new concept of
subset posterior inclusion probability for inference to prioritize predictors
that target at subset(s) of responses. Extensive simulations with varying
sparsity and heterogeneity levels and dimension have shown the advantage of our
method in variable selection and prediction performance as compared to existing
general Bayesian multivariate variable selection methods and Bayesian fine
mapping methods. We also applied our method to a real data example in imaging
genetics and identified important causal variants for brain white matter
structural change in different regions.Comment: 29 pages, 3 figure
Large-scale variational inference for Bayesian joint regression modelling of high-dimensional genetic data
Genetic association studies have become increasingly important in understanding the molecular bases of complex human traits. The specific analysis of intermediate molecular traits, via quantitative trait locus (QTL) studies, has recently received much attention, prompted by the advance of high-throughput technologies for quantifying gene, protein and metabolite levels. Of great interest is the detection of weak trans-regulatory effects between a genetic variant and a distal gene product. In particular, hotspot genetic variants, which remotely control the levels of many molecular outcomes, may initiate decisive functional mechanisms underlying disease endpoints.
This thesis proposes a Bayesian hierarchical approach for joint analysis of QTL data on a genome-wide scale. We consider a series of parallel sparse regressions combined in a hierarchical manner to flexibly accommodate high-dimensional responses (molecular levels) and predictors (genetic variants), and we present new methods for large-scale inference.
Existing approaches have limitations. Conventional marginal screening does not account for local dependencies and association patterns common to multiple outcomes and genetic variants, whereas joint modelling approaches are restricted to relatively small datasets by computational constraints. Our novel framework allows information-sharing across outcomes and variants, thereby enhancing the detection of weak trans and hotspot effects, and implements tailored variational inference procedures that allow simultaneous analysis of data for an entire QTL study, comprising hundreds of thousands of predictors, and thousands of responses and samples.
The present work also describes extensions to leverage spatial and functional information on the genetic variants, for example, using predictor-level covariates such as epigenomic marks. Moreover, we augment variational inference with simulated annealing and parallel expectation-maximisation schemes in order to enhance exploration of highly multimodal spaces and allow efficient empirical Bayes estimation.
Our methods, publicly available as packages implemented in R and C++, are extensively assessed in realistic simulations. Their advantages are illustrated in several QTL applications, including a large-scale proteomic QTL study on two clinical cohorts that highlights novel candidate biomarkers for metabolic disorders
Bayesian Sparse Mediation Analysis with Targeted Penalization of Natural Indirect Effects
Causal mediation analysis aims to characterize an exposure's effect on an
outcome and quantify the indirect effect that acts through a given mediator or
a group of mediators of interest. With the increasing availability of
measurements on a large number of potential mediators, like the epigenome or
the microbiome, new statistical methods are needed to simultaneously
accommodate high-dimensional mediators while directly target penalization of
the natural indirect effect (NIE) for active mediator identification. Here, we
develop two novel prior models for identification of active mediators in
high-dimensional mediation analysis through penalizing NIEs in a Bayesian
paradigm. Both methods specify a joint prior distribution on the
exposure-mediator effect and mediator-outcome effect with either (a) a
four-component Gaussian mixture prior or (b) a product threshold Gaussian
prior. By jointly modeling the two parameters that contribute to the NIE, the
proposed methods enable penalization on their product in a targeted way.
Resultant inference can take into account the four-component composite
structure underlying the NIE. We show through simulations that the proposed
methods improve both selection and estimation accuracy compared to other
competing methods. We applied our methods for an in-depth analysis of two
ongoing epidemiologic studies: the Multi-Ethnic Study of Atherosclerosis (MESA)
and the LIFECODES birth cohort. The identified active mediators in both studies
reveal important biological pathways for understanding disease mechanisms
High-Throughput Genotyping Analyses and Image-based Phenotyping in Sorghum bicolor
Sorghum bicolor is a valuable plant grown commercially for grain, forage, sugar, and lignocellulosic biomass production. Increasing yields for these applications without increasing inputs is necessary to sustainably meet future food and fuel demand.
The generation of superior plant cultivars that produce more without increased input
is facilitated by methods that can rapidly and accurately acquire plant genotypic and phenotypic data, and this dissertation describes the development and application of genomic and phenomic methods to improve crop productivity. The sensitivity and specificity with which genetic variants are called from sorghum genomic sequence data was improved by developing a variant calling workflow; this workflow interrelates different sources of genomic sequence data to inform the modern machine learning techniques implemented within the Broad Institute's Genome Analysis Toolkit (GATK). Genetic variants called in this manner have been used to dissect the genetic basis of agriculturally important traits and improve the sorghum reference genome assembly. Additionally, to increase the rate at which the morphology of plants can be evaluated, an image-based phenotyping platform was developed to acquire measurements of sorghum shoot architecture traits using a depth camera. Depth images of plants are used to generate 3D reconstructions, and these reconstructions are used to measure phenotypes, to identify the genetic bases of shoot architecture, and as input to plant and crop modeling applications. This research facilitates the rapid and accurate acquisition of the data necessary to increase the rate of crop improvemen
Generative Models of Biological Variations in Bulk and Single-cell RNA-seq
The explosive growth of next-generation sequencing data enhances our ability to understand biological process at an unprecedented resolution. Meanwhile organizing and utilizing this tremendous amount of data becomes a big challenge. High-throughput technology provides us a snapshot of all underlying biological activities, but this kind of extremely high-dimensional data is hard to interpret. Due to the curse of dimensionality, the measurement is sparse and far from enough to shape the actual manifold in the high-dimensional space. On the other hand, the measurements may contain structured noise such as technical or nuisance biological variation which can interfere downstream interpretation. Generative modeling is a powerful tool to make sense of the data and generate compact representations summarizing the embedded biological information. This thesis introduces three generative models that help amplifying biological signals buried in the noisy bulk and single-cell RNA-seq data.
In Chapter 2, we propose a semi-supervised deconvolution framework called PLIER which can identify regulations in cell-type proportions and specific pathways that control gene expression. PLIER has inspired the development of MultiPLIER and has been used to infer context-specific genotype effects in the brain.
In Chapter 3, we construct a supervised transformation named DataRemix to normalize bulk gene expression profiles in order to maximize the biological findings with respect to a variety of downstream tasks. By reweighing the contribution of hidden factors, we are able to reveal the hidden biological signals without any external dataset-specific knowledge. We apply DataRemix to the ROSMAP dataset and report the first replicable trans-eQTL effect in human brain.
In Chapter 4, we focus on scRNA-seq and introduce NIFA which is an unsupervised decomposition framework that combines the desired properties of PCA, ICA and NMF. It simultaneously models uni- and multi-modal factors isolating discrete cell-type identity and continuous pathway-level variations into separate components.
The work presented in Chapter 2 has been published as a journal article. The work in Chapter 3 and Chapter 4 are under submission and they are available as preprints on bioRxiv
- …