Analysis of Admixed Animals using Indirect Haplotype Information from Existing Technologies

Abstract

The use of genotyping and sequencing technologies in genetic studies typically involves inspecting variants defined within a single reference genome. While this definition of genetic variation promotes a simple model of the genome that is easy to organize and analyze, it does not encompass the full breadth of variation possible between individuals. Fortunately, existing technologies capture information about genomic variation outside the original targeted variants. By incorporating these low-level signals, which classical methods generally regard as noise, we can make more accurate inferences about the relationship between admixed animals and their ancestral and parental strains. In this thesis, I use both genotyping microarrays and RNA sequencing data to demonstrate the utility of using signals from ancestral haplotype data to analyze admixed animals. I introduce a novel method for designing a genotyping microarray that provides maximal information about ancestral haplotypes for the admixed population Collaborative Cross (CC). The result is the 78K-marker MegaMUGA array, which achieves high call rates and distinction power in a diverse set of mouse strains. Using probe intensities from microarrays such as the MegaMUGA, I develop methods for founder haplotype inference as well as quantitative trait loci (QTL) mapping. I show that these intensity-based methods outperform traditional genotype call-based methods due to their ability to capture additional information about the local sequence, which I confirm using high-throughput sequencing data within probe regions. In addition to demonstrating my thesis with microarray intensity data, I also use RNA-seq read data from parental strains to estimate allele-specific expression (ASE) in the F1 offspring. By directly using parental read data as features in a regularized regression problem, I can achieve accurate estimations of the offspring's expressed gene transcripts and allele-specific expression levels, showing that no matter the data source, incorporating low-level signals directly from ancestral strains provides a more accurate template for analysis of admixed strains.Doctor of Philosoph

    Similar works