200 research outputs found

    Hierarchical bayesian models for genome-wide association studies

    Get PDF
    I consider a well-known problem in the field of statistical genetics called a genome-wide association study (GWAS) where the goal is to identify a set of genetic markers that are associated to a disease. A typical GWAS data set contains, for thousands of unrelated individuals, a set of hundreds of thousands of markers, a set of other covariates such as age, gender, smoking status and other risk factors, and a response variable that indicates the presence or absence of a particular disease. Due to biological phenomena such as the recombination of DNA and linkage disequilibrium, parents are more likely to pass parts of DNA that lie close to each other on a chromosome together to their offspring; this non-random association between adjacent markers leads to strong correlation between markers in GWAS data sets. As a statistician, I reduce the complex problem of GWAS to its essentials, i.e. variable selection on a large-p-small-n data set that exhibits multicollinearity, and develop solutions that complement and advance the current state-of-the-art methods. Before outlining and explaining my contributions to the field in detail, I present a literature review that summarizes the history of GWAS and the relevant tools and techniques that researchers have developed over the years for this problem

    Probabilistic Inference for Nucleosome Positioning with MNase-Based or Sonicated Short-Read Data

    Get PDF
    We describe a model-based method, PING, for predicting nucleosome positions in MNase-Seq and MNase- or sonicated-ChIP-Seq data. PING compares favorably to NPS and TemplateFilter in scalability, accuracy and robustness to low read density. To demonstrate that PING predictions from widely available sonicated data can have sufficient spatial resolution to be to be useful for biological inference, we use Illumina H3K4me1 ChIP-seq data to detect changes in nucleosome positioning around transcription factor binding sites due to tamoxifen stimulation, to discriminate functional and non-functional transcription factor binding sites more effectively than with enrichment profiles, and to confirm that the pioneer transcription factor Foxa2 associates with the accessible major groove of nucleosomal DNA

    Biological network models for inferring mechanism of action, characterizing cellular phenotypes, and predicting drug response

    Get PDF
    A primary challenge in the analysis of high-throughput biological data is the abundance of correlated variables. A small change to a gene's expression or a protein's binding availability can cause significant downstream effects. The existence of such chain reactions presents challenges in numerous areas of analysis. By leveraging knowledge of the network interactions that underlie this type of data, we can often enable better understanding of biological phenomena. This dissertation will examine network-based statistical approaches to the problems of mechanism-of-action inference, characterization of gene expression changes, and prediction of drug response. First, we develop a method for multi-target perturbation detection in multi-omics biological data. We estimate a joint Gaussian graphical model across multiple data types using penalized regression, and filter for network effects. Next, we apply a set of likelihood ratio tests to identify the most likely site of the original perturbation. We also present a conditional testing procedure to allow for detection of secondary perturbations. Second, we address the problem of characterization of cellular phenotypes via Bayesian regression in the Gene Ontology (GO). In our model, we use the structure of the GO to assign changes in gene expression to functional groups, and to model the covariance between these groups. In addition to describing changes in expression, we use these functional activity estimates to predict the expression of unobserved genes. We further determine when such predictions are likely to be inaccurate by identifying GO terms with poor agreement to gene-level estimates. In a case study, we identify GO terms relevant to changes in the growth rate of S. cerevisiae. Lastly, we consider the prediction of drug sensitivity in cancer cell lines based on pathway-level activity estimates from ASSIGN, a Bayesian factor analysis model. We use penalized regression to predict response to various cancer treatments based on cancer subtype, pathway activity, and 2-way interactions thereof. We also present network representations of these interaction models and examine common patterns in their structure across treatments

    Structured Sparse Methods for Imaging Genetics

    Get PDF
    abstract: Imaging genetics is an emerging and promising technique that investigates how genetic variations affect brain development, structure, and function. By exploiting disorder-related neuroimaging phenotypes, this class of studies provides a novel direction to reveal and understand the complex genetic mechanisms. Oftentimes, imaging genetics studies are challenging due to the relatively small number of subjects but extremely high-dimensionality of both imaging data and genomic data. In this dissertation, I carry on my research on imaging genetics with particular focuses on two tasks---building predictive models between neuroimaging data and genomic data, and identifying disorder-related genetic risk factors through image-based biomarkers. To this end, I consider a suite of structured sparse methods---that can produce interpretable models and are robust to overfitting---for imaging genetics. With carefully-designed sparse-inducing regularizers, different biological priors are incorporated into learning models. More specifically, in the Allen brain image--gene expression study, I adopt an advanced sparse coding approach for image feature extraction and employ a multi-task learning approach for multi-class annotation. Moreover, I propose a label structured-based two-stage learning framework, which utilizes the hierarchical structure among labels, for multi-label annotation. In the Alzheimer's disease neuroimaging initiative (ADNI) imaging genetics study, I employ Lasso together with EDPP (enhanced dual polytope projections) screening rules to fast identify Alzheimer's disease risk SNPs. I also adopt the tree-structured group Lasso with MLFre (multi-layer feature reduction) screening rules to incorporate linkage disequilibrium information into modeling. Moreover, I propose a novel absolute fused Lasso model for ADNI imaging genetics. This method utilizes SNP spatial structure and is robust to the choice of reference alleles of genotype coding. In addition, I propose a two-level structured sparse model that incorporates gene-level networks through a graph penalty into SNP-level model construction. Lastly, I explore a convolutional neural network approach for accurate predicting Alzheimer's disease related imaging phenotypes. Experimental results on real-world imaging genetics applications demonstrate the efficiency and effectiveness of the proposed structured sparse methods.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Probabilistic Models for Aggregate Analysis of Non-Gaussian Data in Biomedicine

    Get PDF
    Aggregate association analysis is a popular way in genome-wide association studies (GWAS) that analyzes the association between the trait of interest and regions of functionally related genes, which has the advantage of capturing the missing heritability from the joint effects of correlated genetic variants while providing a better understanding of disease etiology from a systematic perspective. However, traditional methods lose their power for biomedical data with non-Gaussian data types. We proposed innovative statistical models to derive more accurate aggregated signals to enhance the power by taking account of the special data types. Based on general exponential family distribution assumptions, we developed supervised logistic PCA and supervised categorical PCA for pathway based GWAS and rare variant analysis. A general framework, sparse exponential family PCA (SePCA), is further developed for aggregate analyses for various types of biomedical data with good interpretation. We derived an efficient algorithm to find the optimal aggregated signals by solving its equivalent dual problem with closed-form updating rules. SePCA is extended for aggregate association analysis in hierarchical levels for better biological interpretation, from groups to individual variables. Both simulation studies and real world applications have demonstrated that our methods can achieve higher power in association analysis and population stratification by taking good care of the correlations among the non-Gaussian variables in biomedical data. Another analytic issue in aggregate analysis is that biomedical data often have special stratified data structures due to the experiment design to solve confounding issues. We extended SePCA to low-rank and full-rank matched models to take account of the stratified data structures. The simulation study has demonstrated their capability of reconstructing more relevant PCs for the signals of interest compared to standard ePCA. A sparse low-rank matched PCA model outperforms the existing Bayesian methods in detecting differentially expressed genes for a benchmark spike-in gene study with technical replicates. In summary, our proposed statistical models for non-Gaussian biomedical data can derive more accurate and robust aggregated signals that help reveal underlying biological principles of human disease. Other than bioinformatics, these probabilistic models also have rich applications in data mining, computer vision, and social science areas

    Unconventional machine learning of genome-wide human cancer data

    Full text link
    Recent advances in high-throughput genomic technologies coupled with exponential increases in computer processing and memory have allowed us to interrogate the complex aberrant molecular underpinnings of human disease from a genome-wide perspective. While the deluge of genomic information is expected to increase, a bottleneck in conventional high-performance computing is rapidly approaching. Inspired in part by recent advances in physical quantum processors, we evaluated several unconventional machine learning (ML) strategies on actual human tumor data. Here we show for the first time the efficacy of multiple annealing-based ML algorithms for classification of high-dimensional, multi-omics human cancer data from the Cancer Genome Atlas. To assess algorithm performance, we compared these classifiers to a variety of standard ML methods. Our results indicate the feasibility of using annealing-based ML to provide competitive classification of human cancer types and associated molecular subtypes and superior performance with smaller training datasets, thus providing compelling empirical evidence for the potential future application of unconventional computing architectures in the biomedical sciences
    • …
    corecore