116 research outputs found
Haplotype estimation in polyploids using DNA sequence data
Polyploid organisms possess more than two copies of their core genome and therefore contain k>2 haplotypes for each set of ordered genomic variants. Polyploidy occurs often within the plant kingdom, among others in important corps such as potato (k=4) and wheat (k=6). Current sequencing technologies enable us to read the DNA and detect genomic variants, but cannot distinguish between the copies of the genome, each inherited from one of the parents. To detect inheritance patterns in populations, it is necessary to know the haplotypes, as alleles that are in linkage over the same chromosome tend to be inherited together. In this work, we develop mathematical optimisation algorithms to indirectly estimate haplotypes by looking into overlaps between the sequence reads of an individual, as well as into the expected inheritance of the alleles in a population. These algorithm deal with sequencing errors and random variations in the counts of reads observed from each haplotype. These methods are therefore of high importance for studying the genetics of polyploid crops. </p
Recommended from our members
Haplotype Assembly and Small Variant Calling using Emerging Sequencing Technologies
Short read DNA sequencing technologies from Illumina have made sequencing a human genome significantly more affordable, greatly accelerating studies of biological function and the association of genetic variants to disease. These technologies are frequently used to detect small genetic variants such as single nucleotide variants (SNVs) using a reference genome. However, short read sequencing technologies have several limitations. First, the human genome is diploid and short reads contain limited information for assembling haplotypes, or the sequences of alleles on homologous chromosomes. Moreover, there is significant input DNA required, which poses challenges for analyzing single cells. Further, there is limited ability to detect genetic variants inside long duplicated sequences that occur in the genome. As a result, there has been widespread development of novel methods to overcome these deficiencies using short reads. These include clone based sequencing, linked read sequencing, and proximity ligation sequencing, as well as various single cell sequencing methods. There are also entirely new sequencing technologies from Pacific Biosciences and Oxford Nanopore Technologies that produce significantly longer reads. While these emerging methods and technologies demonstrate improvements compared to short reads, they also have properties and error modalities that pose unique computational challenges. Moreover, there is a shortage of bioinformatics methods for accurate small variant detection and haplotype assembly using these approaches compared to short reads. This dissertation aims to address this problem with the introduction of several new algorithms for highly accurate haplotype assembly and SNV calling. First, it introduces HapCUT2, an algorithm that can rapidly assemble haplotypes using a broad range of sequencing technologies. Second, it introduces an algorithm for variant calling and haplotyping using SISSOR, a recently introduced microfluidics based technology for sequencing single cells. Finally, it introduces Longshot, an algorithm for detecting and phasing SNVs using error-prone long read technologies. In each case, the algorithms are benchmarked using multiple real whole-genome sequencing datasets and are found to be highly accurate. The methods introduced in this dissertation contribute to the goal of sequencing diploid genomes accurately and completely for a broad range of scientific and clinical purposes
Genome Diversity and the Origin of the Arabian Horse
The Arabian horse, one of the world\u27s oldest breeds of any domesticated animal, is characterized by natural beauty, graceful movement, athletic endurance, and, as a result of its development in the arid Middle East, the ability to thrive in a hot, dry environment. Here we studied 378 Arabian horses from 12 countries using equine single nucleotide polymorphism (SNP) arrays and whole-genome re-sequencing to examine hypotheses about genomic diversity, population structure, and the relationship of the Arabian to other horse breeds. We identified a high degree of genetic variation and complex ancestry in Arabian horses from the Middle East region. Also, contrary to popular belief, we could detect no significant genomic contribution of the Arabian breed to the Thoroughbred racehorse, including Y chromosome ancestry. However, we found strong evidence for recent interbreeding of Thoroughbreds with Arabians used for flat-racing competitions. Genetic signatures suggestive of selective sweeps across the Arabian breed contain candidate genes for combating oxidative damage during exercise, and within the Straight Egyptian subgroup, for facial morphology. Overall, our data support an origin of the Arabian horse in the Middle East, no evidence for reduced global genetic diversity across the breed, and unique genetic adaptations for both physiology and conformation
Computational haplotyping : theory and practice
Genomics has paved a new way to comprehend life and its evolution, and also to investigate causes of diseases and their treatment. One of the important problems in genomic analyses is haplotype assembly. Constructing complete and accurate haplotypes plays an essential role in understanding population genetics and how species evolve. In this thesis, we focus on computational approaches to haplotype assembly from third generation sequencing technologies. This involves huge amounts of sequencing data, and such data contain errors due to the single molecule sequencing protocols employed. Taking advantage of combinatorial formulations helps to correct for these errors to solve the haplotyping problem. Various computational techniques such as dynamic programming, parameterized algorithms, and graph algorithms are used to solve this problem. This thesis presents several contributions concerning the area of haplotyping. First, a novel algorithm based on dynamic programming is proposed to provide approximation guarantees for phasing a single individual. Second, an integrative approach is introduced to combining multiple sequencing datasets to generating complete and accurate haplotypes. The effectiveness of this integrative approach is demonstrated on a real human genome. Third, we provide a novel efficient approach to phasing pedigrees and demonstrate its advantages in comparison to phasing a single individual. Fourth, we present a generalized graph-based framework for performing haplotype-aware de novo assembly. Specifically, this generalized framework consists of a hybrid pipeline for generating accurate and complete haplotypes from data stemming from multiple sequencing technologies, one that provides accurate reads and other that provides long reads.Die Genomik hat neue Wege eröffnet, die es ermöglichen, die Evolution lebendiger Organismen zu verstehen, sowie die Ursachen zahlreicher Krankheiten zu erforschen und neue Therapien zu entwickeln. Ein wichtiges Problem ist die Assemblierung der Haplotypen eines Individuums. Diese Rekonstruktion von Haplotypen spielt eine zentrale Rolle für das Verständnis der Populationsgenetik und der Evolution einer Spezies. In der vorliegenden Arbeit werden Algorithmen zur Assemblierung von Haplotypen vorgestellt, die auf Sequenzierdaten der dritten Generation basieren. Dies erfordert große Mengen an Daten, welche wiederum Fehler enthalten, die die zugrunde liegenden Sequenzierprotokolle hervorbringen. Durch kombinatorische Formulierungen des Problems ist die Rekonstruktion von Haplotypen dennoch möglich, da Fehler erfolgreich korrigiert werden können. Verschiedene informatische Methoden, wie dynamische Programmierung, parametrisierte Algorithmen und Graph Algorithmen können verwendet werden, um dieses Problem zu lösen. Die vorliegende Arbeit stellt mehrere Lösungsansätze für die Rekonstruktion von Haplotypen vor. Als erstes wird ein neuartiger Algorithmus vorgestellt, der basierend auf dem Prinzip der dynamischen Programmierung Approximationsgarantien für das Haplotyping eines einzelnen Individuums liefert. Als zweites wird ein integrativer Ansatz präsentiert, um mehrere Sequenzierdatensätze zu kombinieren und somit akkurate Haplotypen zu generieren. Die Effektivität dieser Methode wird auf einem echten, menschlichen Datensatz demonstriert. Als drittes wird ein neuer, effzienter Algorithmus beschrieben, um Haplotypen verwandter Individuen simultan zu konstruieren und die Vorteile gegenüber der Betrachtung einzelner Individuen aufgezeigt. Als viertes präsentieren wir eine Graph-basierte Methode um mittels Haplotypinformation de-novo Assemblierung durchzuführen. Dieser Methode kombiniert Daten stammend von verschiedenen Sequenziertechnologien, welche entweder genaue oder aber lange Sequenzierreads liefern
Benchmarking phasing software with a whole-genome sequenced cattle pedigree.
peer reviewed[en] BACKGROUND: Accurate haplotype reconstruction is required in many applications in quantitative and population genomics. Different phasing methods are available but their accuracy must be evaluated for samples with different properties (population structure, marker density, etc.). We herein took advantage of whole-genome sequence data available for a Holstein cattle pedigree containing 264 individuals, including 98 trios, to evaluate several population-based phasing methods. This data represents a typical example of a livestock population, with low effective population size, high levels of relatedness and long-range linkage disequilibrium.
RESULTS: After stringent filtering of our sequence data, we evaluated several population-based phasing programs including one or more versions of AlphaPhase, ShapeIT, Beagle, Eagle and FImpute. To that end we used 98 individuals having both parents sequenced for validation. Their haplotypes reconstructed based on Mendelian segregation rules were considered the gold standard to assess the performance of population-based methods in two scenarios. In the first one, only these 98 individuals were phased, while in the second one, all the 264 sequenced individuals were phased simultaneously, ignoring the pedigree relationships. We assessed phasing accuracy based on switch error counts (SEC) and rates (SER), lengths of correctly phased haplotypes and the probability that there is no phasing error between a pair of SNPs as a function of their distance. For most evaluated metrics or scenarios, the best software was either ShapeIT4.1 or Beagle5.2, both methods resulting in particularly high phasing accuracies. For instance, ShapeIT4.1 achieved a median SEC of 50 per individual and a mean haplotype block length of 24.1 Mb (scenario 2). These statistics are remarkable since the methods were evaluated with a map of 8,400,000 SNPs, and this corresponds to only one switch error every 40,000 phased informative markers. When more relatives were included in the data (scenario 2), FImpute3.0 reconstructed extremely long segments without errors.
CONCLUSIONS: We report extremely high phasing accuracies in a typical livestock sample. ShapeIT4.1 and Beagle5.2 proved to be the most accurate, particularly for phasing long segments and in the first scenario. Nevertheless, most tools achieved high accuracy at short distances and would be suitable for applications requiring only local haplotypes
Direct chromosome-length haplotyping by single-cell sequencing
Haplotypes are fundamental to fully characterize the diploid genome of an individual, yet methods to directly chart the unique genetic makeup of each parental chromosome are lacking. Here we introduce single-cell DNA template strand sequencing (Strand-seq) as a novel approach to phasing diploid genomes along the entire length of all chromosomes. We demonstrate this by building a complete haplotype for a HapMap individual (NA12878) at high accuracy (concordance 99.3%), without using generational information or statistical inference. By use of this approach, we mapped all meiotic recombination events in a family trio with high resolution (median range ∼14 kb) and phased larger structural variants like deletions, indels, and balanced rearrangements like inversions. Lastly, the single-cell resolution of Strand-seq allowed us to observe loss of heterozygosity regions in a small number of cells, a significant advantage for studies of heterogeneous cell populations, such as cancer cells. We conclude that Strand-seq is a unique and powerful approach to completely phase individual genomes and map inheritance patterns in families, while preserving haplotype differences between single cells
Genomic insights into fine-scale recombination variation in adaptively diverging threespine stickleback fish (Gasterosteus aculeatus)
Meiotic recombination is one of the major molecular mechanisms generating
genetic diversity and influencing genome evolution. By shuffling allelic
combinations, it can directly influence the patterns and efficacy of natural
selection. Studies in various organisms have shown that the rate and placement of
recombination varies substantially within the genome, among individuals,
between sexes and among different species. It is hypothesized that this variation
plays an important role in genome evolution. In this PhD thesis, I investigated the
extent and molecular basis of recombination variation in adaptively diverging
threespine stickleback fish (Gasterosteus aculeatus) to further understand its
evolutionary implications. I used both ChIP-sequencing and whole genome
sequencing of pedigrees to empirically identify and quantify double strand breaks
(DSBs) and meiotic crossovers (COs). Whole genome sequencing of large nuclear
families was performed to identify meiotic crossovers in 36 individuals of
diverging marine and freshwater ecotypes and their hybrids. This produced the
first genome-wide high-resolution sex-specific and ecotype-specific map of
contemporary recombination events in sticklebacks. The results show striking
differences in crossover number and placement between sexes. Females recombine
nearly 1.76 times more than males and their COs are distributed all over the
chromosome while male COs predominantly occur near the chromosomal
periphery. When compared among ecotypes a significant reduction in overall
recombination rate was observed in hybrid females compared to pure forms. Even
though the known loci underlying marine-freshwater adaptive divergence tend to
fall in regions of low recombination, considerable female recombination is
observed in the regions between adaptive loci. This suggests that the sexual
dimorphism in recombination phenotype may have important evolutionary
implications.
At the fine-scale, COs and male DSBs are nonrandomly distributed
involving ‘semi-hot’ hotspots and coldspots of recombination. I report a significant
association of male DSBs and COs with functionally active open chromatin regions
like gene promoters, whereas female COs did not show an association more than
expected by chance. However, a considerable number of COs and DSBs away from
any of the tested open chromatin marks suggests possibility of additional novel
mechanisms of recombination regulation in sticklebacks.
In addition, we developed a novel method for constructing individualized
recombination maps from pooled gamete DNA using linked read sequencing
technology by 10X Genomics®. We tested the method by contrasting recombination
profiles of gametic and somatic tissue from a hybrid mouse and stickleback fish.
Our pipeline faithfully detects previously described recombination hotspots in
mice at high resolution and identify many novel hotspots across the genome in
both species and thereby demonstrate the efficiency of the novel method. This
method could be employed for large scale QTL mapping studies to further
understand the genetic basis of recombination variation reported in this thesis.
By bridging the gap between natural populations and lab organisms with
large clutch sizes and tractable genetic tools, this work shows the utility of the
stickleback system and provides important groundwork for further studies of
heterochiasmy and divergence in recombination during adaptation to differing
environments
- …