4 research outputs found

    Statistical Methods for Analyzing Population-scale Genomic and Transcriptomic Data

    Full text link
    The study of genetics is an integral part to understanding the biology behind our complex traits and can be approached in a variety of ways. Technological advancements in the field of genomics have enabled unprecedented large-scale studies which have identified numerous statistical associations between many diseases and our genes. Recently, studies involving gene expression have become an increasingly popular approach to understanding the biological pathways underlying statistical associations. In this dissertation, I address specific challenges related to the study of gene expression, including meta-imputation of expression across multiple datasets with only summary-level imputation models available, correcting for technical biases towards reference alleles in array-based expression assays, and identifying tissue-specific and population-specific regulatory variants and trait-associated loci in the context of systems genetics with whole genome sequencing, transcriptomics profiles, morphometric traits, and clinical endpoints. In Chapter 2, I develop a method which leverages multiple datasets to accurately impute tissue-specific gene expression levels. Our method, Smartly Weighted Averaging across Multiple Tissues (SWAM) does not train directly from data, but rather performs a meta-imputation by combines extant imputation models by assigning weights based on their predictive performance and similarity to the tissue of interest. I demonstrate that when using the same set of resources, SWAM improves imputation accuracy compared to existing approaches that impute tissue-specific expression by training directly from raw data. The major benefit of using the SWAM meta-imputation framework is the flexibility to combine multiple pre-trained imputation models trained from privacy-protected raw datasets. Indeed, prediction accuracy is substantially improved when integrating multiple datasets, highlighting the importance of using multiple datasets. In Chapter 3, I examine the benefits of using deep whole genome sequencing to empower and refine existing microarray-based eQTL studies. I revisited a well-known hybridization bias that arises in microarray studies caused by genetic polymorphisms within target probe sequences. In this chapter, I interrogated the impact of genetic variants from whole genome sequencing to accurately identify and characterize this bias at both the probe and probeset level. I evaluated several approaches to account for hybridization bias, including methods to remove variant-overlapping probes, and a novel method to adjust hybridization bias for each probe. I demonstrate that accounting for variant-overlapping probes when quantifying expression levels reduces reference bias and false positives in cis-eQTL analyses. I also demonstrate that adjusting for hybridization bias with deeply sequenced genomes is ideal to avoid reference bias, although leveraging publicly available variant catalogues such as the 1000 Genomes data provides comparable benefits. In Chapter 4, I performed a systems genetic study of Pima Native Americans enrolled in a diabetic nephropathy study. I integrate whole genome sequences, transcriptomic profiles, and morphometric traits derived from two micro-dissected renal compartments – glomerular and tubulointerstitial – and clinical phenotypes to identify significant associations between these molecular and complex traits. I identified thousands of eQTLs, including kidney-specific and population-specific eQTLs. I also identified many transcriptional associations with morphometric and clinical phenotypes enriched for kidney-specific biological pathways. Moreover, through dimension reduction techniques, I identified genome-wide significant genetic associations with a morphometric trait (podocyte volume), and with a composite trait representing albumin-creatin ration and glomerular surface volume, which was obtained from dimensionality reduction techniques. Studying this unique and richly-phenotyped cohort resulted many population- and tissue-specific regulatory variants, genes, and pathways implicated for renal disease progression.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/170016/1/aeyliu_1.pd

    ‘maskBAD’ – a package to detect and remove Affymetrix probes with binding affinity differences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Hybridization differences caused by target sequence differences can be a confounding factor in analyzing gene expression on microarrays, lead to false positives and reduce power to detect real expression differences. We prepared an R Bioconductor compatible package to detect, characterize and remove such probes in Affymetrix 3’IVT and exon-based arrays on the basis of correlation of signal intensities from probes within probe sets.</p> <p>Results</p> <p>Using completely mouse genomes we determined type 1 (false negatives) and type 2 (false positives) errors with high accuracy and we show that our method routinely outperforms previous methods. When detecting 76.2% of known SNP/indels in mouse expression data, we obtain at most 5.5% false positives. At the same level of false positives, best previous method detected 72.6%. We also show that probes with differing binding affinity both hinder differential expression detection and introduce artifacts in cancer-healthy tissue comparison.</p> <p>Conclusions</p> <p>Detection and removal of such probes should be a routine step in Affymetrix data preprocessing. We prepared a user friendly R package, compatible with Bioconductor, that allows the filtering and improving of data from Affymetrix microarrays experiments.</p

    Computational analysis of innate and adaptive immune responses

    Get PDF
    Both innate and adaptive immune processes rely on the activation of differentiated haematopoietic stem cell lineages to affect an appropriate response to pathogens. This thesis employs a largely network biology focused approach to better understand the specificity of immune cell responses in two distinct cases of pathogenic challenge. In the context of adaptive immunity, I studied the transcriptional responses of T cells during Graft-versus-Host Disease (GvHD). GvHD represents one of the major complications to arise following allogeneic hematopoietic stem cell transplantation and yet why only particular organs are damaged as a result of this pathology is still unclear. To investigate whether key GvHD transcriptional signatures seen in effector CD8+ T cells compared to naïve T cells are triggered in target organs or the secondary lymphoid organs, a module-based association test was developed to combine the output of gene clustering algorithms with that of differential expression analysis. This methodology significantly aided the identification of skin specific effector T cell transcriptional programs believed to drive murine GvHD pathogenesis at this site. Turning to the innate immune response, I investigated the transcriptional profiles of resting and activated macrophages in the setting of Tuberculosis (TB), the second leading cause of death from infectious disease worldwide. Regression-based analyses and clustering of macrophage expression data provided insight into the variations in gene expression profiles of naïve macrophages compared to those infected with Mycobacterium tuberculosis (MTB) or a vaccine strain of mycobacteria (BCG). The availability of genotype data as part of the macrophage dataset facilitated an expression quantitative trait loci (eQTL) study which highlighted a novel association between the cytoskeleton gene BCAR1 and TB risk, together with a previously undescribed trans-eQTL module specific to MTB infected macrophages. Potential genetic variants impacting expression of the aforementioned GvHD specific T cell transcriptional signatures were additionally investigated using external trans-eQTL datasets
    corecore