Mapping known and novel genetic variation in the human genome: bioinformatic tool development and applications

Abstract

The study of human genetics was greatly facilitated by the sequencing of the first human genome in 2001. A race to develop and perfectionize DNA sequencing technologies and data analysis followed this milestone project, that has enabled the sequencing of thousands of human genomes since. Based on the sequencing data from many human genomes, gathered through consortia such as the thousand genome project and the Genome of the Netherlands, an average human genome was found to vary at a few million loci compared to the genome of an unrelated human individual. Currently, roughly ~100 million genetic variations have been found so far, but new variation is discovered with every sequenced genome. Thousands of genetic variants were associated to common and/or rare disease. The processes through which genetic variation results in disease are sometimes linked directly to altering one of the ~20,000 known genes’ product content or abundance and have even enabled new therapies. In many cases however, the functional consequences of genetic variation were hard to identify precisely. These functional effects could be further explained by relating the genetic variation to more distal regions that interact with a gene or by affecting DNA organization and conformation. While information about the sequence content as well as about many other relevant DNA features (such as conformation and regulation) may be retrieved through sequencing, the type of different sequencing technology eventually used can have a significant impact on results. Thus, current sequencing technologies that produce short, but highly accurate readouts of the genome are successfully employed to determine the genetic content of most loci in the genome. Analyzing more complex structural variation within a genome, or reconstructing regions of a genome however, requires long-range information that is cumbersome, to obtain from the short read-outs. Alternative technologies have emerged, that are able to produce very large read-outs of our genome and can offer the information necessary to reconstruct complex regions. These longer read-outs are currently, relatively more erroneous, making the analysis of short genetic variation very hard. My work in this thesis concerns the development of appropriate methodologies to accurately extract and value all the information that state of the art sequencing technologies produce, and I show how different sequencing technologies are best suited for interrogating the human genome for different types of variation and information. Overall, this thesis illustrates how using the appropriate methodology and technology is key for reaching accurate and clear conclusions from large amounts of genetic data useful both in a research and in a diagnostic setting. Short-read accurate sequencing technologies are a benchmark for small and/or rare genetic variation, whereas emerging long-read technologies are perfectly suited for larger, structural variation. Furthermore, by reading longer stretches of DNA, nanopore sequencing may be instrumental for understanding functional consequences of genetic variation and facilitate data integration and a paradigm shift towards analyzing an individual’s genome in its entirety

    Similar works

    Full text

    thumbnail-image

    Available Versions

    Last time updated on 15/05/2019