IDENTIFYING AND CHARACTERIZING TRANSPOSABLE ELEMENTS IN THE GENOME

Abstract

A large fraction of mammalian genome consists of transposable elements (TEs). These elements are segments of DNA that either move or are copied from one place in the genome to another. TEs are a significant source of genetic variation and are directly responsible for many diseases. It is difficult to identify, map, characterize, and determine the zygosity of TEs using current high-throughput short-read sequencing data because of their numerous copies in the genome. Existing approaches search for TE insertion (TEi) by aligning millions of mostly irrelevant short reads to either a reference genome or a TE sequence library. In this dissertation I describe two alignment-free novel TEi detection algorithms, ELITE and Frontier which outperform existing tools in several different categories. Both algorithms use local-genome-assembly where ELITE is template-dependent and Frontier is template-free. The key idea is to focus on identifying the boundary of TE insertion which contains partial TE and non-TE context. I use an msBWT-based data structure to store and index all the reads from a high-throughput sequencing dataset and leverages additional data structures FM-index and Longest Common Prefix (LCP) to efficiently search for TEi boundaries. I show that combination of two methods can identify nearly all the Endogenous RetoVirus (ERV) insertions that are segregating in a population with more than 100 samples. These methods can also be used to identify very recent or de novo TE insertions. Moreover, characterization based on the sharing pattern of ERVis allows us to study phylogeny within a population.Doctor of Philosoph

    Similar works