Computational methods to study gene regulation in humans using DNA and RNA sequencing data

Abstract

Genes work in a coordinated fashion to perform complex functions. Disruption of gene regulatory programs can result in disease, highlighting the importance of understanding them. We can leverage large-scale DNA and RNA sequencing data to decipher gene regulatory relationships in humans. In this thesis, we present three projects on regulation of gene expression by other genes and by genetic variants using two computational frameworks: co-expression networks and expression quantitative trait loci (eQTL). First, we investigate the effect of alignment errors in RNA sequencing on detecting trans-eQTLs and co-expression of genes. We demonstrate that misalignment due to sequence similarity between genes may result in over 75% false positives in a standard trans-eQTL analysis. It produces a higher than background fraction of potential false positives in a conventional co-expression study too. These false-positive associations are likely to misleadingly replicate between studies. We present a metric, cross-mappability, to detect and avoid such false positives. Next, we focus on joint regulation of transcription and splicing in humans. We present a framework called transcriptome-wide networks (TWNs) for combining total expression of genes and relative isoform levels into a single sparse network, capturing the interplay between the regulation of splicing and transcription. We build TWNs for 16 human tissues and show that the hubs with multiple isoform neighbors in these networks are candidate alternative splicing regulators. Then, we study the tissue-specificity of network edges. Using these networks, we detect 20 genetic variants with distant regulatory impacts. Finally, we present a novel network inference method, SPICE, to study the regulation of transcription. Using maximum spanning trees, SPICE prioritizes potential direct regulatory relationships between genes. We also formulate a comprehensive set of metrics using biological data to establish a standard to evaluate biological networks. According to most of these metrics, SPICE performs better than current popular network inference methods when applied to RNA-sequencing data from diverse human tissues

    Similar works