Search CORE

3,122 research outputs found

Efficient haplotype block recognition of very long and dense genetic sequences

Author: A Christoforou
A Petersen
AJ Lorenz
AL Price
C Dering
C Pattaro
C Song
C Zapata
C Zapata
Cristian Pattaro
DA Tregouet
Daniel Taliun
DE Reich
EC Anderson
H Shim
J Gibson
J Park
JC Barrett
JC Lambert
JD Wall
Johann Gamper
K Wang
K Zhang
K Zhang
MJ Daly
N Patil
O Delaneau
P Flicek
R Mourad
RC Lewontin
S Gu
S Purcell
SB Gabriel
The 1000 Genomes Project Consortium
The International HapMap Consortium
W Hill
WJ Kent
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Second-generation PLINK: rising to the challenge of larger and richer datasets

Author: Chang Christopher C.
Chow Carson C.
Lee James J.
Purcell Shaun M.
Tellier Laurent C. A. M.
Vattikuti Shashaank
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/10/2014
Field of study

PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for even faster and more scalable implementations of key functions. In addition, GWAS and population-genetic data now frequently contain probabilistic calls, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data format capable of efficiently representing probabilities, phase, and multiallelic variants, and (b) extensions of many functions to account for the new types of information. The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.Comment: 2 figures, 1 additional fil

arXiv.org e-Print Archive

CiteSeerX

Springer - Publisher Connector

Harvard University - DASH

Copenhagen University Research Information System

PubMed Central

Haplotype inference based on Hidden Markov Models in the QTL-MAS 2010 multi-generational dataset

Author: A Clark
C Nettelblad
Carl Nettelblad
F Jelinek
J Hernandez-Sanchez
J Li
KW Broman
KW Broman
L Excoffier
LE Baum
LR Rabiner
RR Hudson
T Niu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background We have previously demonstrated an approach for efficient computation of genotype probabilities, and more generally probabilities of allele inheritance in inbred as well as outbred populations. That work also included an extension for haplotype inference, or phasing, using Hidden Markov Models. Computational phasing of multi-thousand marker datasets has not become common as of yet. In this communication, we further investigate the method presented earlier for such problems, in a multi-generational dataset simulated for QTL detection. Results When analyzing the dataset simulated for the 14th QTLMAS workshop, the phasing produced showed zero deviations compared to original simulated phase in the founder generation. In total, 99.93% of all markers were correctly phased. 97.68% of the individuals were correct in all markers over all 5 simulated chromosomes. Results were produced over a weekend on a small computational cluster. The specific algorithmic adaptations needed for the Markov model training approach in order to reach convergence are described. Conclusions Our method provides efficient, near-perfect haplotype inference allowing the determination of completely phased genomes in dense pedigrees. These developments are of special value for applications where marker alleles are not corresponding directly to QTL alleles, thus necessitating tracking of allele origin, and in complex multi-generational crosses. The cnF2freq codebase, which is in a current state of active development, is available under a BSD-style license.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

This is just a phase : the impact of population structure on haplotype phasing and linkage disequilibrium measures at functional genetic sites.

Author: Leiter Roxanne Kaaren
Publication venue: ThinkIR: The University of Louisville\u27s Institutional Repository
Publication date: 01/12/2017
Field of study

The block-like structure of the human genome has been the subject of many scientific papers and is of practical significance in large-scale genome-wide association studies. How stringent haplotype block boundaries are within and between populations has been the subject of ongoing debate within human population genetics. This thesis will contribute to the description of universal and population-specific haplotype blocks at functional sites, namely across the IL-10 gene family (including IL-10, IL-19, IL-20 and IL-24), which is involved in a number of immune system processes, and MAPKAP-K2, an adjacent and functionally significant kinase gene. Beyond the description of blocks across these sites in different populations, this thesis will also measure the impact of the haplotype phasing process on downstream applications of linkage disequilibrium analysis, which underlies much of the research on human haplotype blocks. The five genes in this analysis span just over 200kb on the q arm of chromosome 1. A total of 80 samples from the Coriell Institute of Medical Research are used in this analysis and represent Andean, Basque, Chinese, Iberian, Indo-Pakistani, Middle Eastern, Russian, South African and North African populations. Some haplotype block boundaries were concordant with gene boundaries with most populations showing a consistent boundary between IL-20 and IL-24 and at least half of the study populations showing consistent boundaries between MAPKAP-K2, IL-10 and IL-20. The only gene boundary lacking a persistent haplotype block boundary was between IL-19 and IL-20. The haplotype phasing programs PHASE and Beagle shared 13 of 15 haplotype block boundaries in common while MDBlocks and Beagle only shared 2 haplotype block boundaries and PHASE and MDBlocks only shared 1 block boundary. These data indicate that there are indeed population-specific differences in the distribution of LD across these five sites. Despite these differences, there is a general trend of high LD across each gene with a breakdown of LD at gene boundaries across all populations

University of Louisville

Genome-wide inference of ancestral recombination graphs

Author: Gronau Ilan
Hubisz Melissa J.
Rasmussen Matthew D.
Siepel Adam
Publication venue
Publication date: 01/01/2013
Field of study

The complex correlation structure of a collection of orthologous DNA sequences is uniquely captured by the "ancestral recombination graph" (ARG), a complete record of coalescence and recombination events in the history of the sample. However, existing methods for ARG inference are computationally intensive, highly approximate, or limited to small numbers of sequences, and, as a consequence, explicit ARG inference is rarely used in applied population genomics. Here, we introduce a new algorithm for ARG inference that is efficient enough to apply to dozens of complete mammalian genomes. The key idea of our approach is to sample an ARG of n chromosomes conditional on an ARG of n-1 chromosomes, an operation we call "threading." Using techniques based on hidden Markov models, we can perform this threading operation exactly, up to the assumptions of the sequentially Markov coalescent and a discretization of time. An extension allows for threading of subtrees instead of individual sequences. Repeated application of these threading operations results in highly efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these methods in a computer program called ARGweaver. Experiments with simulated data indicate that ARGweaver converges rapidly to the true posterior distribution and is effective in recovering various features of the ARG for dozens of sequences generated under realistic parameters for human populations. In applications of ARGweaver to 54 human genome sequences from Complete Genomics, we find clear signatures of natural selection, including regions of unusually ancient ancestry associated with balancing selection and reductions in allele age in sites under directional selection. Preliminary results also indicate that our methods can be used to gain insight into complex features of human population structure, even with a noninformative prior distribution.Comment: 88 pages, 7 main figures, 22 supplementary figures. This version contains a substantially expanded genomic data analysi

arXiv.org e-Print Archive

CiteSeerX

Cold Spring Harbor Laboratory Institutional Repository

Directory of Open Access Journals

PubMed Central

FigShare

Recommended from our members

Haplotype Assembly and Small Variant Calling using Emerging Sequencing Technologies

Author: Edge Peter Joseph
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Short read DNA sequencing technologies from Illumina have made sequencing a human genome significantly more affordable, greatly accelerating studies of biological function and the association of genetic variants to disease. These technologies are frequently used to detect small genetic variants such as single nucleotide variants (SNVs) using a reference genome. However, short read sequencing technologies have several limitations. First, the human genome is diploid and short reads contain limited information for assembling haplotypes, or the sequences of alleles on homologous chromosomes. Moreover, there is significant input DNA required, which poses challenges for analyzing single cells. Further, there is limited ability to detect genetic variants inside long duplicated sequences that occur in the genome. As a result, there has been widespread development of novel methods to overcome these deficiencies using short reads. These include clone based sequencing, linked read sequencing, and proximity ligation sequencing, as well as various single cell sequencing methods. There are also entirely new sequencing technologies from Pacific Biosciences and Oxford Nanopore Technologies that produce significantly longer reads. While these emerging methods and technologies demonstrate improvements compared to short reads, they also have properties and error modalities that pose unique computational challenges. Moreover, there is a shortage of bioinformatics methods for accurate small variant detection and haplotype assembly using these approaches compared to short reads. This dissertation aims to address this problem with the introduction of several new algorithms for highly accurate haplotype assembly and SNV calling. First, it introduces HapCUT2, an algorithm that can rapidly assemble haplotypes using a broad range of sequencing technologies. Second, it introduces an algorithm for variant calling and haplotyping using SISSOR, a recently introduced microfluidics based technology for sequencing single cells. Finally, it introduces Longshot, an algorithm for detecting and phasing SNVs using error-prone long read technologies. In each case, the algorithms are benchmarked using multiple real whole-genome sequencing datasets and are found to be highly accurate. The methods introduced in this dissertation contribute to the goal of sequencing diploid genomes accurately and completely for a broad range of scientific and clinical purposes

eScholarship - University of California

Genotype/Haplotype Tagging Methods and their Validation

Author: Zhang Jun
Publication venue: ScholarWorks @ Georgia State University
Publication date: 06/11/2007
Field of study

This study focuses how the MLR-tagging for statistical covering, i.e. either maximizing average R2 for certain number of requested tags or minimizing number of tags such that for any non-tag SNP there exists a highly correlated (squared correlation R2 \u3e 0.8) tag SNP. We compare with tagger, a software for selecting tags in hapMap project. MLR-tagging needs less number of tags than tagger in all 6 cases of the given test sets except 2. Meanwhile, Biologists can detect or collect data only from a small set. So, this will bring a problem for scientists that the estimates accuracy of tag SNPs when constructing the complete human haplotype map. This study investigates how the MLR-tagging for statistically coverage performs under unbias study. The experiment results shows MLR-tagging still select small amount of SNPs very well even without observing the entire SNP in the sample

ScholarWorks @ Georgia State University

Molecular systematics and phylogeography of the Helmeted Guineafowl (Numida meleagris)

Author: Van Alphen-Stahl Jonathan
Publication venue: Department of Biological Sciences
Publication date: 01/01/2005
Field of study

Includes bibliographical references (leaves 61-67)

Cape Town University OpenUCT

Recommended from our members

Efficient analysis and storage of large-scale genomic data

Author: Klarqvist Marcus
Publication venue: University of Cambridge
Publication date: 01/09/2019
Field of study

The impending advent of population-scaled sequencing cohorts involving tens of millions of individuals with matched phenotypic measurements will produce unprecedented volumes of genetic data. Storing and analysing such gargantuan datasets places computational performance at a pivotal position in medical genomics. In this thesis, I explore the potential for accelerating and parallelizing standard genetics workflows, file formats, and algorithms using both hardware-accelerated vectorization, parallel and distributed algorithms, and heterogeneous computing. First, I describe a novel bit-counting operation termed the positional population-count, which can be used together with succinct representations and standard efficient operations to accelerate many genetic calculations. In order to enable the use of this new operator and the canonical population count on any target machine I developed a unified low-level library using CPU dispatching to select the optimal method contingent on the available instruction set architecture and the given input size at run-time. As a proof-of-principle application, I apply the positional population-count operator to computing quality control-related summary statistics for terabyte-scaled sequencing readsets with >3,800-fold speed improvements. As another application, I describe a framework for efficiently computing the cardinality of set intersection using these operators and applied this framework to efficiently compute genome-wide linkage-disequilibrium in datasets with up to 67 million samples resulting in up to >60-fold improvements in speed for dense genotypic vectors and up to >250,000-fold savings in memory and >100,000-fold improvement in speed for sparse genotypic vectors. I next describe a framework for handling the terabytes of compressed output data and describe graphical routines for visualizing long-range linkage-disequilibrium blocks as seen over many human centromeres. Finally, I describe efficient algorithms for storing and querying very large genetic datasets and specialized algorithms for the genotype component of such datasets with >10,000-fold savings in memory compared to the current interchange format.Wellcome Trus

Apollo (Cambridge)

A method for identifying ancient introgression between caballine and non-caballine equids using whole genome high throughput data.

Author: de Silva Kalpani
Publication venue: ThinkIR: The University of Louisville\u27s Institutional Repository
Publication date: 01/12/2021
Field of study

Introgression is one of the main mechanisms that transfer adapted alleles between species. The advantageous variants will get positively selected and retained in the recipient population while rest of the variants undergo negative selection. When analyzing horse genome, two alleles were found in CXCL16 gene, one associated with susceptibility and one with resistance to developing persistent shedding of the Equine Arteritis Virus. The two alleles differ by 4 non-synonymous variants in exon 1 of the gene. Comparison with 3 non-caballine equids (zebras, asses and hemiones) revealed that one haplotype was almost identical to the haplotype found in non-caballines while the other had differences characteristic of 4.5 million years since a common ancestor. Based on this observation, we project that an ancient introgression event occurred between caballine and non-caballine equids. If so, we should be able to find more instances of introgression between these species. We developed a method to identify putatively introgressed segments in the horse genome. It is estimated that non-caballine equids such as zebras and asses diverged from horses between 4 and 4.5 MYA. Genomic analysis of these animals vs. equine reference genome reveals the divergence at both the nucleotide and chromosomal level. Whole genome data for the non-caballine equids when mapped to the caballine (Equus caballus) reference genome show a greater frequency of single nucleotide differences than horses have relative to the same reference. We have created a Likelihood Estimate framework that uses this difference in single nucleotide frequencies to predict whether a haplotype evolved along the caballine or non-caballine lineage. Our results demonstrated that these haplotypes are between 0.5 and 2kb in length and are detectable at a rate of several hundred loci per horse. About 1.1% of the equine genome was introgressed and 64% of the identified putative regions were associated with either structural elements, regulatory regions, or both. These regions were responsible for gene products involved in regulation of response to stimuli, signal transduction, integral components of cell membrane and important metabolism pathways such as purine metabolism and thiamine metabolism. Furthermore, these haplotypes occur at high frequency in the horse population suggesting that they are positively selected by evolution

University of Louisville