814 research outputs found

    Discrete Algorithms for Analysis of Genotype Data

    Get PDF
    Accessibility of high-throughput genotyping technology makes possible genome-wide association studies for common complex diseases. When dealing with common diseases, it is necessary to search and analyze multiple independent causes resulted from interactions of multiple genes scattered over the entire genome. The optimization formulations for searching disease-associated risk/resistant factors and predicting disease susceptibility for given case-control study have been introduced. Several discrete methods for disease association search exploiting greedy strategy and topological properties of case-control studies have been developed. New disease susceptibility prediction methods based on the developed search methods have been validated on datasets from case-control studies for several common diseases. Our experiments compare favorably the proposed algorithms with the existing association search and susceptibility prediction methods

    Algorithms for Computational Genetics Epidemiology

    Get PDF
    The most intriguing problems in genetics epidemiology are to predict genetic disease susceptibility and to associate single nucleotide polymorphisms (SNPs) with diseases. In such these studies, it is necessary to resolve the ambiguities in genetic data. The primary obstacle for ambiguity resolution is that the physical methods for separating two haplotypes from an individual genotype (phasing) are too expensive. Although computational haplotype inference is a well-explored problem, high error rates continue to deteriorate association accuracy. Secondly, it is essential to use a small subset of informative SNPs (tag SNPs) accurately representing the rest of the SNPs (tagging). Tagging can achieve budget savings by genotyping only a limited number of SNPs and computationally inferring all other SNPs. Recent successes in high throughput genotyping technologies drastically increase the length of available SNP sequences. This elevates importance of informative SNP selection for compaction of huge genetic data in order to make feasible fine genotype analysis. Finally, even if complete and accurate data is available, it is unclear if common statistical methods can determine the susceptibility of complex diseases. The dissertation explores above computational problems with a variety of methods, including linear algebra, graph theory, linear programming, and greedy methods. The contributions include (1)significant speed-up of popular phasing tools without compromising their quality, (2)stat-of-the-art tagging tools applied to disease association, and (3)graph-based method for disease tagging and predicting disease susceptibility

    Population genetics of identity by descent

    Get PDF
    Recent improvements in high-throughput genotyping and sequencing technologies have afforded the collection of massive, genome-wide datasets of DNA information from hundreds of thousands of individuals. These datasets, in turn, provide unprecedented opportunities to reconstruct the history of human populations and detect genotype-phenotype association. Recently developed computational methods can identify long-range chromosomal segments that are identical across samples, and have been transmitted from common ancestors that lived tens to hundreds of generations in the past. These segments reveal genealogical relationships that are typically unknown to the carrying individuals. In this work, we demonstrate that such identical-by-descent (IBD) segments are informative about a number of relevant population genetics features: they enable the inference of details about past population size fluctuations, migration events, and they carry the genomic signature of natural selection. We derive a mathematical model, based on coalescent theory, that allows for a quantitative description of IBD sharing across purportedly unrelated individuals, and develop inference procedures for the reconstruction of recent demographic events, where classical methodologies are statistically underpowered. We analyze IBD sharing in several contemporary human populations, including representative communities of the Jewish Diaspora, Kenyan Maasai samples, and individuals from several Dutch provinces, in all cases retrieving evidence of fine-scale demographic events from recent history. Finally, we expand the presented model to describe distributions for those sites in IBD shared segments that harbor mutation events, showing how these may be used for the inference of mutation rates in humans and other species.Comment: Ph.D. thesi

    FRET studies of a landscape of Lac repressor-mediated DNA loops

    Get PDF
    DNA looping mediated by the Lac repressor is an archetypal test case for modeling protein and DNA flexibility. Understanding looping is fundamental to quantitative descriptions of gene expression. Systematic analysis of LacI•DNA looping was carried out using a landscape of DNA constructs with lac operators bracketing an A-tract bend, produced by varying helical phasings between operators and the bend. Fluorophores positioned on either side of both operators allowed direct Förster resonance energy transfer (FRET) detection of parallel (P1) and antiparallel (A1, A2) DNA looping topologies anchored by V-shaped LacI. Combining fluorophore position variant landscapes allows calculation of the P1, A1 and A2 populations from FRET efficiencies and also reveals extended low-FRET loops proposed to form via LacI opening. The addition of isopropyl-β-d-thio-galactoside (IPTG) destabilizes but does not eliminate the loops, and IPTG does not redistribute loops among high-FRET topologies. In some cases, subsequent addition of excess LacI does not reduce FRET further, suggesting that IPTG stabilizes extended or other low-FRET loops. The data align well with rod mechanics models for the energetics of DNA looping topologies. At the peaks of the predicted energy landscape for V-shaped loops, the proposed extended loops are more stable and are observed instead, showing that future models must consider protein flexibility

    The Relevance of Pedigrees in the Conservation Genomics Era

    Get PDF
    Over the past 50 years conservation genetics has developed a substantive toolbox to inform species management. One of the most long-standing tools available to manage genetics—the pedigree—has been widely used to characterize diversity and maximize evolutionary potential in threatened populations. Now, with the ability to use high throughput sequencing to estimate relatedness, inbreeding, and genome-wide functional diversity, some have asked whether it is warranted for conservation biologists to continue collecting and collating pedigrees for species management. In this perspective, we argue that pedigrees remain a relevant tool, and when combined with genomic data, create an invaluable resource for conservation genomic management. Genomic data can address pedigree pitfalls (e.g., founder relatedness, missing data, uncertainty), and in return robust pedigrees allow for more nuanced research design, including well-informed sampling strategies and quantitative analyses (e.g., heritability, linkage) to better inform genomic inquiry. We further contend that building and maintaining pedigrees provides an opportunity to strengthen trusted relationships among conservation researchers, practitioners, Indigenous Peoples, and Local Communities

    Issues in information integration of omics data: microarray meta-analysis for candidate marker and module detection and genotype calling incorporating family information

    Get PDF
    Nowadays, more and more high-throughput genomic data sets are publicly available; therefore, performing meta-analysis to combine results from independent studies becomes an essential approach to increase the statistical power, for example, in the detection of differentially expressed genes in microarray studies. In addition to meta-analysis, researchers also incorporate pathway or clinical information from external databases to perform integrative analysis. In this thesis, I will present three projects which encompass three types of integrative analysis. First, we perform a comprehensive comparative study to evaluate 12 microarray meta-analysis methods in simulation studies and real examples by using four quantitative criteria: detection capability, biological association, stability and robustness, and we propose a practical guideline for practitioners to choose the most appropriate meta-analysis method in real applications. Second, we develop a meta-clustering method to construct co-expressed modules from 11 major depressive disorder transcriptome datasets, incorporated with GWAS and pathway information from external databases. Third, we propose a computationally feasible algorithm to call genotypes with higher accuracy by considering family information from next generation sequencing data for two purposes: (1) to propose a new genotype calling algorithm for complex families, and (2) to extend our algorithm to incorporate external reference panels to analyze family-based sequence data with a small sample size. In conclusion, we develop several integrative methods for omics data analysis and the result improves public health significance for biomarker detection in biomedical research and provides insights to help understand the underlying disease mechanisms

    Towards Complete and Error-Free Genome Assemblies of all Vertebrate Species

    Get PDF

    Haplotype-resolved genome assembly of an F1 hybrid of Eucalyptus urophylla x E. grandis

    Get PDF
    Dissertation (MSc (Genetics))--University of Pretoria, 2021.De novo haplotype phased genome assemblies based on long-read sequencing technologies have improved the detection and characterization of structural variants (SVs) in plant and animal genomes. As long-reads are able to span across haplotypes, they also allow phased (haplo) assemblies of highly heterozygous genomes such as those of forest trees. Knowledge of SV function and their resulting impact on gene expression can be used by breeders to guide tree improvement. Eucalyptus species and hybrids are some of the most widely planted hardwood trees. Hybrids are often preferred as they combine the genetic background of two species to produce more resilient trees that can inhabit a wider environmental deployment range. For example, E. urophylla x E. grandis hybrids combines disease resistance of E. urophylla with fast growth and desirable wood properties of E. grandis. However, to use such a strategy in eucalypt breeding firstly requires a high-quality reference genome (preferably phased) with which additional de novo assembled genomes can be compared. The aim of this study was to assemble high-quality haplotype phased genomes for Eucalyptus urophylla and E. grandis. Using Nanopore sequencing data generated for an E. urophylla x E. grandis F1 hybrid and a trio-binning approach, we successfully assembled 544.51 Mb of the E. urophylla haplogenome (contig N50 of 1.93 Mb) and 566.75 Mb of the E. grandis haplogenome (contig N50 of 2.42 Mb) with a BUSCO completion score of 98.8%. Using high-density SNP genetic linkage maps of both parents, more than 88% of the haplogenome contigs could be anchored to one of the eleven chromosomes (scaffold N50 of 42.45 Mb and 43.82 Mb for the E. urophylla and E. grandis haplogenome assemblies, respectively). We also provide the first genome-wide comparison between the E. urophylla and E. grandis using the Synteny and Rearrangement Identifier (SyRI) to identify SVs, leading to the discovery of 48,729 SVs between the two haplogenomes. This study is the first step towards implementing haplotype-informed molecular breeding of Eucalyptus tree species.The National Research Foundation of South Africa, the South African Department of Science and Innovation, the Technology Innovation Agency and Technology and Human Resources for Industry ProgrammeGeneticsMSc (Genetics)Unrestricte
    • …
    corecore