814 research outputs found
Discrete Algorithms for Analysis of Genotype Data
Accessibility of high-throughput genotyping technology makes possible genome-wide association studies for common complex diseases. When dealing with common diseases, it is necessary to search and analyze multiple independent causes resulted from interactions of multiple genes scattered over the entire genome. The optimization formulations for searching disease-associated risk/resistant factors and predicting disease susceptibility for given case-control study have been introduced. Several discrete methods for disease association search exploiting greedy strategy and topological properties of case-control studies have been developed. New disease susceptibility prediction methods based on the developed search methods have been validated on datasets from case-control studies for several common diseases. Our experiments compare favorably the proposed algorithms with the existing association search and susceptibility prediction methods
Algorithms for Computational Genetics Epidemiology
The most intriguing problems in genetics epidemiology are to predict genetic disease susceptibility and to associate single nucleotide polymorphisms (SNPs) with diseases. In such these studies, it is necessary to resolve the ambiguities in genetic data. The primary obstacle for ambiguity resolution is that the physical methods for separating two haplotypes from an individual genotype (phasing) are too expensive. Although computational haplotype inference is a well-explored problem, high error rates continue to deteriorate association accuracy. Secondly, it is essential to use a small subset of informative SNPs (tag SNPs) accurately representing the rest of the SNPs (tagging). Tagging can achieve budget savings by genotyping only a limited number of SNPs and computationally inferring all other SNPs. Recent successes in high throughput genotyping technologies drastically increase the length of available SNP sequences. This elevates importance of informative SNP selection for compaction of huge genetic data in order to make feasible fine genotype analysis. Finally, even if complete and accurate data is available, it is unclear if common statistical methods can determine the susceptibility of complex diseases. The dissertation explores above computational problems with a variety of methods, including linear algebra, graph theory, linear programming, and greedy methods. The contributions include (1)significant speed-up of popular phasing tools without compromising their quality, (2)stat-of-the-art tagging tools applied to disease association, and (3)graph-based method for disease tagging and predicting disease susceptibility
Recommended from our members
International meta-analysis of PTSD genome-wide association studies identifies sex- and ancestry-specific genetic risk loci.
The risk of posttraumatic stress disorder (PTSD) following trauma is heritable, but robust common variants have yet to be identified. In a multi-ethnic cohort including over 30,000 PTSD cases and 170,000 controls we conduct a genome-wide association study of PTSD. We demonstrate SNP-based heritability estimates of 5-20%, varying by sex. Three genome-wide significant loci are identified, 2 in European and 1 in African-ancestry analyses. Analyses stratified by sex implicate 3 additional loci in men. Along with other novel genes and non-coding RNAs, a Parkinson's disease gene involved in dopamine regulation, PARK2, is associated with PTSD. Finally, we demonstrate that polygenic risk for PTSD is significantly predictive of re-experiencing symptoms in the Million Veteran Program dataset, although specific loci did not replicate. These results demonstrate the role of genetic variation in the biology of risk for PTSD and highlight the necessity of conducting sex-stratified analyses and expanding GWAS beyond European ancestry populations
Population genetics of identity by descent
Recent improvements in high-throughput genotyping and sequencing technologies
have afforded the collection of massive, genome-wide datasets of DNA
information from hundreds of thousands of individuals. These datasets, in turn,
provide unprecedented opportunities to reconstruct the history of human
populations and detect genotype-phenotype association. Recently developed
computational methods can identify long-range chromosomal segments that are
identical across samples, and have been transmitted from common ancestors that
lived tens to hundreds of generations in the past. These segments reveal
genealogical relationships that are typically unknown to the carrying
individuals. In this work, we demonstrate that such identical-by-descent (IBD)
segments are informative about a number of relevant population genetics
features: they enable the inference of details about past population size
fluctuations, migration events, and they carry the genomic signature of natural
selection. We derive a mathematical model, based on coalescent theory, that
allows for a quantitative description of IBD sharing across purportedly
unrelated individuals, and develop inference procedures for the reconstruction
of recent demographic events, where classical methodologies are statistically
underpowered. We analyze IBD sharing in several contemporary human populations,
including representative communities of the Jewish Diaspora, Kenyan Maasai
samples, and individuals from several Dutch provinces, in all cases retrieving
evidence of fine-scale demographic events from recent history. Finally, we
expand the presented model to describe distributions for those sites in IBD
shared segments that harbor mutation events, showing how these may be used for
the inference of mutation rates in humans and other species.Comment: Ph.D. thesi
Recommended from our members
Insights into the genomic histories of diverse human populations using whole-genome sequencing analysis.
Despite the progress in sampling many populations, human genomics research is still not fully reflective of the diversity found globally. Understudied populations limit our knowledge of genetic variation and population history, and their inclusion is needed to ensure they benefit from future developments in genomic medicine. In this thesis, I describe extending our understanding of global genetic diversity and population history by two main projects. The first is focused on structural variation in a diverse set of 54 human populations which are part of the Human Genome Diversity Project (HGDP-CEPH) panel. Using whole-genome sequences previously produced at the Wellcome Sanger Institute, I generated a comprehensive catalogue of structural variation identifying a total of 126,018 variants, of which 78% are novel. Some reach high frequency and are private to continental groups or even individual populations, including regionally-restricted runaway duplications and putatively introgressed variants from archaic hominins. By de novo assembly of 25 genomes using linked-read sequencing, I discovered 1643 breakpoint-resolved unique insertions, in aggregate accounting for 1.9 Mb of sequence absent from the GRCh38 reference genome, highlighting the limitation of a single human reference genome. In the second project I collected and analysed a dataset of 137 high-coverage physically-phased genome sequences from eight Middle Eastern populations using linked-read sequencing. Focusing on the population history using single nucleotide variants, I found no genetic traces of archeologically documented early expansions out-of-Africa in present-day populations in the region. I show that Arabian populations have the lowest Neanderthal ancestry of all non-African populations tested, which is explained by them having elevated Basal Eurasian ancestry. By comparing Levantines and Arabian historical population sizes, I find a divergence that starts before the Neolithic era, when Levantines expanded while Arabians maintained small populations that could have derived ancestry from local epipaleolithic hunter-gatherers. All populations suffered a bottleneck overlapping the archaeologically-documented aridification events, with Arabians decreasing in size with the onset of the desert climate in Arabia ~6 kya while the Levantine bottleneck overlaps the 4.2 kiloyear aridification event. I also identify an ancestry that is associated with the spread of Semitic languages across the region during the Bronze Age. Finally, I identify novel variants that show evidence of selection, including signals of polygenic selection. This thesis fills an important gap in the study of diverse human populations, although further work is needed to sequence and characterize additional genetically underrepresented groups.Government of Dubai - Dubai Police GH
FRET studies of a landscape of Lac repressor-mediated DNA loops
DNA looping mediated by the Lac repressor is an archetypal test case for modeling protein and DNA flexibility. Understanding looping is fundamental to quantitative descriptions of gene expression. Systematic analysis of LacI•DNA looping was carried out using a landscape of DNA constructs with lac operators bracketing an A-tract bend, produced by varying helical phasings between operators and the bend. Fluorophores positioned on either side of both operators allowed direct Förster resonance energy transfer (FRET) detection of parallel (P1) and antiparallel (A1, A2) DNA looping topologies anchored by V-shaped LacI. Combining fluorophore position variant landscapes allows calculation of the P1, A1 and A2 populations from FRET efficiencies and also reveals extended low-FRET loops proposed to form via LacI opening. The addition of isopropyl-β-d-thio-galactoside (IPTG) destabilizes but does not eliminate the loops, and IPTG does not redistribute loops among high-FRET topologies. In some cases, subsequent addition of excess LacI does not reduce FRET further, suggesting that IPTG stabilizes extended or other low-FRET loops. The data align well with rod mechanics models for the energetics of DNA looping topologies. At the peaks of the predicted energy landscape for V-shaped loops, the proposed extended loops are more stable and are observed instead, showing that future models must consider protein flexibility
The Relevance of Pedigrees in the Conservation Genomics Era
Over the past 50 years conservation genetics has developed a substantive toolbox to inform species management. One of the most long-standing tools available to manage genetics—the pedigree—has been widely used to characterize diversity and maximize evolutionary potential in threatened populations. Now, with the ability to use high throughput sequencing to estimate relatedness, inbreeding, and genome-wide functional diversity, some have asked whether it is warranted for conservation biologists to continue collecting and collating pedigrees for species management. In this perspective, we argue that pedigrees remain a relevant tool, and when combined with genomic data, create an invaluable resource for conservation genomic management. Genomic data can address pedigree pitfalls (e.g., founder relatedness, missing data, uncertainty), and in return robust pedigrees allow for more nuanced research design, including well-informed sampling strategies and quantitative analyses (e.g., heritability, linkage) to better inform genomic inquiry. We further contend that building and maintaining pedigrees provides an opportunity to strengthen trusted relationships among conservation researchers, practitioners, Indigenous Peoples, and Local Communities
Issues in information integration of omics data: microarray meta-analysis for candidate marker and module detection and genotype calling incorporating family information
Nowadays, more and more high-throughput genomic data sets are publicly available; therefore, performing meta-analysis to combine results from independent studies becomes an essential approach to increase the statistical power, for example, in the detection of differentially expressed genes in microarray studies. In addition to meta-analysis, researchers also incorporate pathway or clinical information from external databases to perform integrative analysis. In this thesis, I will present three projects which encompass three types of integrative analysis. First, we perform a comprehensive comparative study to evaluate 12 microarray meta-analysis methods in simulation studies and real examples by using four quantitative criteria: detection capability, biological association, stability and robustness, and we propose a practical guideline for practitioners to choose the most appropriate meta-analysis method in real applications. Second, we develop a meta-clustering method to construct co-expressed modules from 11 major depressive disorder transcriptome datasets, incorporated with GWAS and pathway information from external databases. Third, we propose a computationally feasible algorithm to call genotypes with higher accuracy by considering family information from next generation sequencing data for two purposes: (1) to propose a new genotype calling algorithm for complex families, and (2) to extend our algorithm to incorporate external reference panels to analyze family-based sequence data with a small sample size. In conclusion, we develop several integrative methods for omics data analysis and the result improves public health significance for biomarker detection in biomedical research and provides insights to help understand the underlying disease mechanisms
Haplotype-resolved genome assembly of an F1 hybrid of Eucalyptus urophylla x E. grandis
Dissertation (MSc (Genetics))--University of Pretoria, 2021.De novo haplotype phased genome assemblies based on long-read sequencing technologies have improved the detection and characterization of structural variants (SVs) in plant and animal genomes. As long-reads are able to span across haplotypes, they also allow phased (haplo) assemblies of highly heterozygous genomes such as those of forest trees. Knowledge of SV function and their resulting impact on gene expression can be used by breeders to guide tree improvement. Eucalyptus species and hybrids are some of the most widely planted hardwood trees. Hybrids are often preferred as they combine the genetic background of two species to produce more resilient trees that can inhabit a wider environmental deployment range. For example, E. urophylla x E. grandis hybrids combines disease resistance of E. urophylla with fast growth and desirable wood properties of E. grandis. However, to use such a strategy in eucalypt breeding firstly requires a high-quality reference genome (preferably phased) with which additional de novo assembled genomes can be compared. The aim of this study was to assemble high-quality haplotype phased genomes for Eucalyptus urophylla and E. grandis. Using Nanopore sequencing data generated for an E. urophylla x E. grandis F1 hybrid and a trio-binning approach, we successfully assembled 544.51 Mb of the E. urophylla haplogenome (contig N50 of 1.93 Mb) and 566.75 Mb of the E. grandis haplogenome (contig N50 of 2.42 Mb) with a BUSCO completion score of 98.8%. Using high-density SNP genetic linkage maps of both parents, more than 88% of the haplogenome contigs could be anchored to one of the eleven chromosomes (scaffold N50 of 42.45 Mb and 43.82 Mb for the E. urophylla and E. grandis haplogenome assemblies, respectively). We also provide the first genome-wide comparison between the E. urophylla and E. grandis using the Synteny and Rearrangement Identifier (SyRI) to identify SVs, leading to the discovery of 48,729 SVs between the two haplogenomes. This study is the first step towards implementing haplotype-informed molecular breeding of Eucalyptus tree species.The National Research Foundation of South Africa, the South African Department of Science and Innovation, the Technology Innovation Agency and Technology and Human Resources for Industry ProgrammeGeneticsMSc (Genetics)Unrestricte
- …