3,061 research outputs found

    Second-generation PLINK: rising to the challenge of larger and richer datasets

    Get PDF
    PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for even faster and more scalable implementations of key functions. In addition, GWAS and population-genetic data now frequently contain probabilistic calls, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data format capable of efficiently representing probabilities, phase, and multiallelic variants, and (b) extensions of many functions to account for the new types of information. The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.Comment: 2 figures, 1 additional fil

    Haplotype inference based on Hidden Markov Models in the QTL-MAS 2010 multi-generational dataset

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We have previously demonstrated an approach for efficient computation of genotype probabilities, and more generally probabilities of allele inheritance in inbred as well as outbred populations. That work also included an extension for haplotype inference, or phasing, using Hidden Markov Models. Computational phasing of multi-thousand marker datasets has not become common as of yet. In this communication, we further investigate the method presented earlier for such problems, in a multi-generational dataset simulated for QTL detection.</p> <p>Results</p> <p>When analyzing the dataset simulated for the 14th QTLMAS workshop, the phasing produced showed zero deviations compared to original simulated phase in the founder generation. In total, 99.93% of all markers were correctly phased. 97.68% of the individuals were correct in all markers over all 5 simulated chromosomes. Results were produced over a weekend on a small computational cluster. The specific algorithmic adaptations needed for the Markov model training approach in order to reach convergence are described.</p> <p>Conclusions</p> <p>Our method provides efficient, near-perfect haplotype inference allowing the determination of completely phased genomes in dense pedigrees. These developments are of special value for applications where marker alleles are not corresponding directly to QTL alleles, thus necessitating tracking of allele origin, and in complex multi-generational crosses. The cnF2freq codebase, which is in a current state of active development, is available under a BSD-style license.</p

    This is just a phase : the impact of population structure on haplotype phasing and linkage disequilibrium measures at functional genetic sites.

    Get PDF
    The block-like structure of the human genome has been the subject of many scientific papers and is of practical significance in large-scale genome-wide association studies. How stringent haplotype block boundaries are within and between populations has been the subject of ongoing debate within human population genetics. This thesis will contribute to the description of universal and population-specific haplotype blocks at functional sites, namely across the IL-10 gene family (including IL-10, IL-19, IL-20 and IL-24), which is involved in a number of immune system processes, and MAPKAP-K2, an adjacent and functionally significant kinase gene. Beyond the description of blocks across these sites in different populations, this thesis will also measure the impact of the haplotype phasing process on downstream applications of linkage disequilibrium analysis, which underlies much of the research on human haplotype blocks. The five genes in this analysis span just over 200kb on the q arm of chromosome 1. A total of 80 samples from the Coriell Institute of Medical Research are used in this analysis and represent Andean, Basque, Chinese, Iberian, Indo-Pakistani, Middle Eastern, Russian, South African and North African populations. Some haplotype block boundaries were concordant with gene boundaries with most populations showing a consistent boundary between IL-20 and IL-24 and at least half of the study populations showing consistent boundaries between MAPKAP-K2, IL-10 and IL-20. The only gene boundary lacking a persistent haplotype block boundary was between IL-19 and IL-20. The haplotype phasing programs PHASE and Beagle shared 13 of 15 haplotype block boundaries in common while MDBlocks and Beagle only shared 2 haplotype block boundaries and PHASE and MDBlocks only shared 1 block boundary. These data indicate that there are indeed population-specific differences in the distribution of LD across these five sites. Despite these differences, there is a general trend of high LD across each gene with a breakdown of LD at gene boundaries across all populations

    Genome-wide inference of ancestral recombination graphs

    Get PDF
    The complex correlation structure of a collection of orthologous DNA sequences is uniquely captured by the "ancestral recombination graph" (ARG), a complete record of coalescence and recombination events in the history of the sample. However, existing methods for ARG inference are computationally intensive, highly approximate, or limited to small numbers of sequences, and, as a consequence, explicit ARG inference is rarely used in applied population genomics. Here, we introduce a new algorithm for ARG inference that is efficient enough to apply to dozens of complete mammalian genomes. The key idea of our approach is to sample an ARG of n chromosomes conditional on an ARG of n-1 chromosomes, an operation we call "threading." Using techniques based on hidden Markov models, we can perform this threading operation exactly, up to the assumptions of the sequentially Markov coalescent and a discretization of time. An extension allows for threading of subtrees instead of individual sequences. Repeated application of these threading operations results in highly efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these methods in a computer program called ARGweaver. Experiments with simulated data indicate that ARGweaver converges rapidly to the true posterior distribution and is effective in recovering various features of the ARG for dozens of sequences generated under realistic parameters for human populations. In applications of ARGweaver to 54 human genome sequences from Complete Genomics, we find clear signatures of natural selection, including regions of unusually ancient ancestry associated with balancing selection and reductions in allele age in sites under directional selection. Preliminary results also indicate that our methods can be used to gain insight into complex features of human population structure, even with a noninformative prior distribution.Comment: 88 pages, 7 main figures, 22 supplementary figures. This version contains a substantially expanded genomic data analysi

    Genotype/Haplotype Tagging Methods and their Validation

    Get PDF
    This study focuses how the MLR-tagging for statistical covering, i.e. either maximizing average R2 for certain number of requested tags or minimizing number of tags such that for any non-tag SNP there exists a highly correlated (squared correlation R2 \u3e 0.8) tag SNP. We compare with tagger, a software for selecting tags in hapMap project. MLR-tagging needs less number of tags than tagger in all 6 cases of the given test sets except 2. Meanwhile, Biologists can detect or collect data only from a small set. So, this will bring a problem for scientists that the estimates accuracy of tag SNPs when constructing the complete human haplotype map. This study investigates how the MLR-tagging for statistically coverage performs under unbias study. The experiment results shows MLR-tagging still select small amount of SNPs very well even without observing the entire SNP in the sample

    Molecular systematics and phylogeography of the Helmeted Guineafowl (Numida meleagris)

    Get PDF
    Includes bibliographical references (leaves 61-67)

    A method for identifying ancient introgression between caballine and non-caballine equids using whole genome high throughput data.

    Get PDF
    Introgression is one of the main mechanisms that transfer adapted alleles between species. The advantageous variants will get positively selected and retained in the recipient population while rest of the variants undergo negative selection. When analyzing horse genome, two alleles were found in CXCL16 gene, one associated with susceptibility and one with resistance to developing persistent shedding of the Equine Arteritis Virus. The two alleles differ by 4 non-synonymous variants in exon 1 of the gene. Comparison with 3 non-caballine equids (zebras, asses and hemiones) revealed that one haplotype was almost identical to the haplotype found in non-caballines while the other had differences characteristic of 4.5 million years since a common ancestor. Based on this observation, we project that an ancient introgression event occurred between caballine and non-caballine equids. If so, we should be able to find more instances of introgression between these species. We developed a method to identify putatively introgressed segments in the horse genome. It is estimated that non-caballine equids such as zebras and asses diverged from horses between 4 and 4.5 MYA. Genomic analysis of these animals vs. equine reference genome reveals the divergence at both the nucleotide and chromosomal level. Whole genome data for the non-caballine equids when mapped to the caballine (Equus caballus) reference genome show a greater frequency of single nucleotide differences than horses have relative to the same reference. We have created a Likelihood Estimate framework that uses this difference in single nucleotide frequencies to predict whether a haplotype evolved along the caballine or non-caballine lineage. Our results demonstrated that these haplotypes are between 0.5 and 2kb in length and are detectable at a rate of several hundred loci per horse. About 1.1% of the equine genome was introgressed and 64% of the identified putative regions were associated with either structural elements, regulatory regions, or both. These regions were responsible for gene products involved in regulation of response to stimuli, signal transduction, integral components of cell membrane and important metabolism pathways such as purine metabolism and thiamine metabolism. Furthermore, these haplotypes occur at high frequency in the horse population suggesting that they are positively selected by evolution
    corecore