170 research outputs found

    Complex ancient genetic structure and cultural transitions in Southern African populations

    Get PDF
    The characterization of the structure of southern African populations has been the subject of numerous genetic, medical, linguistic, archaeological, and anthropological investigations. Current diversity in the subcontinent is the result of complex events of genetic admixture and cultural contact between early inhabitants and migrants that arrived in the region over the last 2000 years. Here, we analyze 1856 individuals from 91 populations, comprising novel and published genotype data, to characterize the genetic ancestry profiles of 631 individuals from 51 southern African populations. Combining both local ancestry and allele frequency based analyses, we identify a tripartite, ancient, Khoesan-related genetic structure. This structure correlates neither with linguistic affiliation nor subsistence strategy, but with geography, revealing the importance of isolation-by-distance dynamics in the area. Fine-mapping of these components in southern African populations reveals admixture and cultural reversion involving several Khoesan groups, and highlights that Bantu speakers and Coloured individuals have different mixtures of these ancient ancestries

    Complex ancient genetic structure and cultural transitions in Southern African populations

    Get PDF
    The characterization of the structure of southern African populations has been the subject of numerous genetic, medical, linguistic, archaeological, and anthropological investigations. Current diversity in the subcontinent is the result of complex events of genetic admixture and cultural contact between early inhabitants and migrants that arrived in the region over the last 2000 years. Here, we analyze 1856 individuals from 91 populations, comprising novel and published genotype data, to characterize the genetic ancestry profiles of 631 individuals from 51 southern African populations. Combining both local ancestry and allele frequency based analyses, we identify a tripartite, ancient, Khoesan-related genetic structure. This structure correlates neither with linguistic affiliation nor subsistence strategy, but with geography, revealing the importance of isolation-by-distance dynamics in the area. Fine-mapping of these components in southern African populations reveals admixture and cultural reversion involving several Khoesan groups, and highlights that Bantu speakers and Coloured individuals have different mixtures of these ancient ancestries

    Unveiling evolutionary algorithm representation with DU maps

    Get PDF
    Evolutionary algorithms (EAs) have proven to be effective in tackling problems in many different domains. However, users are often required to spend a significant amount of effort in fine-tuning the EA parameters in order to make the algorithm work. In principle, visualization tools may be of great help in this laborious task, but current visualization tools are either EA-specific, and hence hardly available to all users, or too general to convey detailed information. In this work, we study the Diversity and Usage map (DU map), a compact visualization for analyzing a key component of every EA, the representation of solutions. In a single heat map, the DU map visualizes for entire runs how diverse the genotype is across the population and to which degree each gene in the genotype contributes to the solution. We demonstrate the generality of the DU map concept by applying it to six EAs that use different representations (bit and integer strings, trees, ensembles of trees, and neural networks). We present the results of an online user study about the usability of the DU map which confirm the suitability of the proposed tool and provide important insights on our design choices. By providing a visualization tool that can be easily tailored by specifying the diversity (D) and usage (U) functions, the DU map aims at being a powerful analysis tool for EAs practitioners, making EAs more transparent and hence lowering the barrier for their use

    Statistical and Computational Methods for Analyzing and Visualizing Large-Scale Genomic Datasets

    Full text link
    Advances in large-scale genomic data production have led to a need for better methods to process, interpret, and organize this data. Starting with raw sequencing data, generating results requires many complex data processing steps, from quality control, alignment, and variant calling to genome wide association studies (GWAS) and characterization of expression quantitative trait loci (eQTL). In this dissertation, I present methods to address issues faced when working with large-scale genomic datasets. In Chapter 2, I present an analysis of 4,787 whole genomes sequenced for the study of age-related macular degeneration (AMD) as a follow-up fine-mapping study to previous work from the International AMD Genomics Consortium (IAMDGC). Through whole genome sequencing, we comprehensively characterized genetic variants associated with AMD in known loci to provide additional insights on the variants potentially responsible for the disease by leveraging 60,706 additional controls. Our study improved the understanding of loci associated with AMD and demonstrated the advantages and disadvantages of different approaches for fine-mapping studies with sequence-based genotypes. In Chapter 3, I describe a novel method and a software tool to perform Hardy-Weinberg equilibrium (HWE) tests for structured populations. In sequence-based genetic studies, HWE test statistics are important quality metrics to distinguish true genetic variants from artifactual ones, but it becomes much less informative when it is applied to a heterogeneous and/or structured population. As next generation sequencing studies contain samples from increasingly diverse ancestries, we developed a new HWE test which addresses both the statistical and computational challenges of modern large-scale sequencing data and implemented the method in a publicly available software tool. Moreover, we extensively evaluated our proposed method with alternative methods to test HWE in both simulated and real datasets. Our method has been successfully applied to the latest variant calling QC pipeline in the TOPMed project. In Chapter 4, I describe PheGET, a web application to interactively visualize Expression Quantitative Trait Loci (eQTLs) across tissues, genes, and regions to aid functional interpretations of regulatory variants. Tissue-specific expression has become increasingly important for understanding the links between genetic variation and disease. To address this need, the Genotype-Tissue Expression (GTEx) project collected and analyzed a treasure trove of expression data. However, effectively navigating this wealth of data to find signals relevant to researchers has become a major challenge. I demonstrate the functionalities of PheGET using the newest GTEx data on our eQTL browser website at https://eqtl.pheweb.org/, allowing the user to 1) view all cis-eQTLs for a single variant; 2) view and compare single-tissue, single-gene associations within any genomic region; 3) find the best eQTL signal in any given genomic region or gene; and 4) customize the plotted data in real time. PheGET is designed to handle and display the kind of complex multidimensional data often seen in our post-GWAS era, such as multi-tissue expression data, in an intuitive and convenient interface, giving researchers an additional tool to better understand the links between genetics and disease.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162918/1/amkwong_1.pd

    Processing genome-wide association studies within a repository of heterogeneous genomic datasets

    Get PDF
    Background Genome Wide Association Studies (GWAS) are based on the observation of genome-wide sets of genetic variants – typically single-nucleotide polymorphisms (SNPs) – in different individuals that are associated with phenotypic traits. Research efforts have so far been directed to improving GWAS techniques rather than on making the results of GWAS interoperable with other genomic signals; this is currently hindered by the use of heterogeneous formats and uncoordinated experiment descriptions. Results To practically facilitate integrative use, we propose to include GWAS datasets within the META-BASE repository, exploiting an integration pipeline previously studied for other genomic datasets that includes several heterogeneous data types in the same format, queryable from the same systems. We represent GWAS SNPs and metadata by means of the Genomic Data Model and include metadata within a relational representation by extending the Genomic Conceptual Model with a dedicated view. To further reduce the gap with the descriptions of other signals in the repository of genomic datasets, we perform a semantic annotation of phenotypic traits. Our pipeline is demonstrated using two important data sources, initially organized according to different data models: the NHGRI-EBI GWAS Catalog and FinnGen (University of Helsinki). The integration effort finally allows us to use these datasets within multisample processing queries that respond to important biological questions. These are then made usable for multi-omic studies together with, e.g., somatic and reference mutation data, genomic annotations, epigenetic signals. Conclusions As a result of our work on GWAS datasets, we enable 1) their interoperable use with several other homogenized and processed genomic datasets in the context of the META-BASE repository; 2) their big data processing by means of the GenoMetric Query Language and associated system. Future large-scale tertiary data analysis may extensively benefit from the addition of GWAS results to inform several different downstream analysis workflows

    Genetic variation and health in rural Caribbean village

    Get PDF
    Bwa Mawego is a small-scale horticultural community (~500 people) on the island of Dominica that has been the site of a longitudinal health research project for more than 30 years. Cardiovascular diseases and metabolic health are growing local concerns. Here we analyze longitudinal growth data, cardiometabolic metrics, and genome-wide single nucleotide polymorphism (SNP) data from this population to investigate sources of variation in anthropometric and cardiometabolic outcomes. Mixed effect heritability models indicate that (1) variation in body mass index (BMI) is significantly shaped by genetic variation, and (2) variation between longitudinal BMI curves has not been consistently impacted by secular environmental trends from 1997 2017. In order to assess genetic variation in more detail, we first characterize the population structure and admixture in this Caribbean community using high-density SNP data and global reference samples in the Human Genome Diversity Panel. We detect four distinct family clusters and admixture from African, European, and Amerindian ancestral populations that occurred 5-6 generations ago (~130-150 years). Amerindian haplotypes represented in Bwa Mawego associate with deeply diverged lineages in Karitiana and Surui peoples, highlighting the regionally variable nature of admixture throughout the Caribbean and unique historical outcomes in Dominica. Genome-wide association tests of cardiometabolic phenotypes identify a genomic region of interest downstream of the ANK3 gene that associates with BMI in Bwa Mawego, after controlling for confounding variation from ancestral population structure and relatedness. Any functional relationship between ANK3 and BMI is currently uncharacterized, and there is unique potential to further explore complex gene-environment-phenotype landscapes in Bwa Mawego.Includes bibliographical reference

    R-Gada: a fast and flexible pipeline for copy number analysis in association studies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Genome-wide association studies (GWAS) using Copy Number Variation (CNV) are becoming a central focus of genetic research. CNVs have successfully provided target genome regions for some disease conditions where simple genetic variation (i.e., SNPs) has previously failed to provide a clear association.</p> <p>Results</p> <p>Here we present a new R package, that integrates: (i) data import from most common formats of Affymetrix, Illumina and aCGH arrays; (ii) a fast and accurate segmentation algorithm to call CNVs based on Genome Alteration Detection Analysis (GADA); and (iii) functions for displaying and exporting the Copy Number calls, identification of recurrent CNVs, multivariate analysis of population structure, and tools for performing association studies. Using a large dataset containing 270 HapMap individuals (Affymetrix Human SNP Array 6.0 Sample Dataset) we demonstrate a flexible pipeline implemented with the package. It requires less than one minute per sample (3 million probe arrays) on a single core computer, and provides a flexible parallelization for very large datasets. Case-control data were generated from the HapMap dataset to demonstrate a GWAS analysis.</p> <p>Conclusions</p> <p>The package provides the tools for creating a complete integrated pipeline from data normalization to statistical association. It can effciently handle a massive volume of data consisting of millions of genetic markers and hundreds or thousands of samples with very accurate results.</p

    Conservation Genomics of Cascades Frogs (Rana cascadae) at the Southern Edge of Their Range

    Get PDF
    Cascades frogs (Rana cascadae) in the southern Cascades Range of California have been declining over the last 30 years, primarily due to the fungal pathogen, Batrachochytrium dendrobatidis (Bd). In the Lassen Region of the southern Cascades, at least six of the eleven remaining localities face extirpation within 50 years. These small and isolated populations are prone to negative genetic effects including reduced diversity and increased inbreeding which could potentially exacerbate declines. I used a large dataset of SNP loci generated from high-throughput sequencing to characterize patterns of genetic structure and diversity in twelve R. cascadae populations in California to prioritize populations for conservation and compared these populations with three in Oregon to determine differences in diversity and population divergence. I also detected outlier loci using genome-scan methods and compared patterns of differentiation between these loci and presumably neutral loci. I found evidence of genetic structure in California creating two main groups of ancestry despite a strong pattern of isolation-by-distance (IBD), with Oregon populations forming a third group. Populations in California were highly differentiated from those in Oregon and had lower estimates of genetic diversity that support documented demographic declines. Rana cascadae was also moderately differentiated between the two main regions within California but genetic diversity was similar. Patterns of genetic differentiation were overall similar between outlier and neutral loci. These findings indicate that Cascades frogs in California should be managed by genetic ancestry and not by ecoregion, as they are currently. Source populations should be selected by choosing the nearest and demographically largest site to the donor population within the same major genetic ancestry group to maximize genetic diversity and minimize both outbreeding and inbreeding depression. This study provides the beginnings for understanding the spatial genetic structuring of Cascades frogs in California and provides managers a way forward for active conservation in the face of ongoing declines

    An Overview of Strategies for Detecting Genotype-Phenotype Associations Across Ancestrally Diverse Populations

    Get PDF
    Genome-wide association studies (GWAS) have been very successful at identifying genetic variants influencing a large number of traits. Although the great majority of these studies have been performed in European-descent individuals, it has been recognised that including populations with differing ancestries enhances the potential for identifying causal SNPs due to their differing patterns of linkage disequilibrium. However, when individuals from distinct ethnicities are included in a GWAS, it is necessary to implement a number of control steps to ensure that the identified associations are real genotype-phenotype relationships. In this Review, we discuss the analyses that are required when performing multi-ethnic studies, including methods for determining ancestry at the global and local level for sample exclusion, controlling for ancestry in association testing, and post-GWAS interrogation methods such as genomic control and meta-analysis. We hope that this overview provides a primer for those researchers interested in including distinct populations in their studies
    • …
    corecore