Search CORE

533 research outputs found

Haplotype-aware Diplotyping from Noisy Long Reads

Author: Ebler J.
Haukness M.
Marschall T.
Paten B.
Pesout T.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

MPG.PuRe

Recommended from our members

Quantifying recent variation and relatedness in human populations

Author: Gusev Alexander
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2012
Field of study

Advances in the genetic analysis of humans have revealed a surprising abundance of local relatedness between purportedly unrelated individuals. Where common mutations classically inform us of ancient relationships, such segments of pairwise identical by descent (IBD) sharing from a common ancestor are the observable traces of recent inter-mating. Combining these two distinct sources of information can help disentangle the complex genetic structure and flux in human populations. When considered together with a heritable trait, the segments can also be used to interrogate unascertained rare variation and help in locating trait-effecting loci. This work presents methods for comprehensive analysis of population-wide IBD and explores applications to disease and the understanding of recent genetic variation. We propose several strategies for efficient detection of IBD segments in population genotype data. Our novel seed-based algorithm, GERMLINE, can reduce the computational burden of finding pairwise segments from quadratic to nearly linear time in a general population. We demonstrate that this approach is several orders of magnitude faster than the available all-pairs methods while maintaining higher accuracy. Next, we extended the GERMLINE technique to process cohorts of unlimited size by adaptively adjusting the search mechanism to meet resource restrictions. We confirm its effectiveness with an analysis of 50,000 individuals where contemporary methods can only process a few thousand. One draw-back of these two algorithms is the dependence on phased haplotype data as input - a constraint that becomes more difficult with large populations. We propose a solution to this problem with an algorithm that analyzes genotype data directly by exploring all potential haplotypes and scoring each putative segment based on linkage-disequilibrium. This solution significantly outperforms available methods when applied to full sequence data and is computationally efficient enough to analyze thousands of sequenced genomes where current methods can only determine haplotypes for several hundred. Secondly, we outline two algorithms for analyzing available IBD segments to increase our understanding of rare variation and complex disease. Motivated by whole-genome sequencing, we present the INFOSTIP algorithm, which uses IBD segments to optimize the selection of individuals for complete population ascertainment. In simulations, we show that INFOSTIP selection can significantly increase variant inference accuracy over random sampling and posit inference of 60% of an isolated population from 1% optimally selected individuals. Seeking to move beyond pairwise IBD segment analysis, we describe the DASH algorithm, which groups shared segments into IBD "clusters" that are likely to be commonly co-inherited and uses them as proxies for un-typed variation. In simulated disease studies, we show this reference-free approach to be much more powerful for detecting rare causal variants than either traditional single-marker analysis or imputation from a general reference panel. Applying the DASH algorithm to disease traits from different populations, we identify multiple novel loci of association. Together, these novel techniques integrate the power of population and disease genetics

Columbia University Academic Commons

Genome-wide inference of ancestral recombination graphs

Author: Gronau Ilan
Hubisz Melissa J.
Rasmussen Matthew D.
Siepel Adam
Publication venue
Publication date: 01/01/2013
Field of study

The complex correlation structure of a collection of orthologous DNA sequences is uniquely captured by the "ancestral recombination graph" (ARG), a complete record of coalescence and recombination events in the history of the sample. However, existing methods for ARG inference are computationally intensive, highly approximate, or limited to small numbers of sequences, and, as a consequence, explicit ARG inference is rarely used in applied population genomics. Here, we introduce a new algorithm for ARG inference that is efficient enough to apply to dozens of complete mammalian genomes. The key idea of our approach is to sample an ARG of n chromosomes conditional on an ARG of n-1 chromosomes, an operation we call "threading." Using techniques based on hidden Markov models, we can perform this threading operation exactly, up to the assumptions of the sequentially Markov coalescent and a discretization of time. An extension allows for threading of subtrees instead of individual sequences. Repeated application of these threading operations results in highly efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these methods in a computer program called ARGweaver. Experiments with simulated data indicate that ARGweaver converges rapidly to the true posterior distribution and is effective in recovering various features of the ARG for dozens of sequences generated under realistic parameters for human populations. In applications of ARGweaver to 54 human genome sequences from Complete Genomics, we find clear signatures of natural selection, including regions of unusually ancient ancestry associated with balancing selection and reductions in allele age in sites under directional selection. Preliminary results also indicate that our methods can be used to gain insight into complex features of human population structure, even with a noninformative prior distribution.Comment: 88 pages, 7 main figures, 22 supplementary figures. This version contains a substantially expanded genomic data analysi

arXiv.org e-Print Archive

CiteSeerX

Cold Spring Harbor Laboratory Institutional Repository

Directory of Open Access Journals

PubMed Central

FigShare

Recommended from our members

PRIMAL: Fast and Accurate Pedigree-based Imputation from Sequence Data in a Founder Population

Author: Abney Mark
Alkorta-Aranburu Gorka
Han Lide
Livne Oren E.
Nicolae Dan L.
Ober Carole
Wentworth-Sheilds William
Publication venue
Publication date: 03/01/2024
Field of study

Founder populations and large pedigrees offer many well-known advantages for genetic mapping studies, including cost-efficient study designs. Here, we describe PRIMAL (PedigRee IMputation ALgorithm), a fast and accurate pedigree-based phasing and imputation algorithm for founder populations. PRIMAL incorporates both existing and original ideas, such as a novel indexing strategy of Identity-By-Descent (IBD) segments based on clique graphs. We were able to impute the genomes of 1,317 South Dakota Hutterites, who had genome-wide genotypes for ~300,000 common single nucleotide variants (SNVs), from 98 whole genome sequences. Using a combination of pedigree-based and LD-based imputation, we were able to assign 87% of genotypes with >99% accuracy over the full range of allele frequencies. Using the IBD cliques we were also able to infer the parental origin of 83% of alleles, and genotypes of deceased recent ancestors for whom no genotype information was available. This imputed data set will enable us to better study the relative contribution of rare and common variants on human phenotypes, as well as parental origin effect of disease risk alleles in >1,000 individuals at minimal cost.</p

Knowledge UChicago

Haplotype estimation in polyploids using DNA sequence data

Author: Motazedi Ehsan
Publication venue: Wageningen University
Publication date: 01/01/2019
Field of study

Polyploid organisms possess more than two copies of their core genome and therefore contain k>2 haplotypes for each set of ordered genomic variants. Polyploidy occurs often within the plant kingdom, among others in important corps such as potato (k=4) and wheat (k=6). Current sequencing technologies enable us to read the DNA and detect genomic variants, but cannot distinguish between the copies of the genome, each inherited from one of the parents. To detect inheritance patterns in populations, it is necessary to know the haplotypes, as alleles that are in linkage over the same chromosome tend to be inherited together. In this work, we develop mathematical optimisation algorithms to indirectly estimate haplotypes by looking into overlaps between the sequence reads of an individual, as well as into the expected inheritance of the alleles in a population. These algorithm deal with sequencing errors and random variations in the counts of reads observed from each haplotype. These methods are therefore of high importance for studying the genetics of polyploid crops. </p

Wageningen University & Research Publications

A Simple Genetic Architecture Underlies Morphological Variation in Dogs

Author: A Auton
A. C McPherron
A. D Gagliardi
A. R Boyko
Abdel G. Elkahloun
Abra Brisbin
Adam Auton
Adam R. Boyko
Adam Siepel
Andy Reynolds
B Weir
B. M vonHoldt
Bridgett M. vonHoldt
C Drögemüller
Carlos D. Bustamante
D Bannasch
D Falush
D. F Conrad
Dana S. Mosher
E Cadieu
E. K Karlsson
E. S Buckler
Elaine A. Ostrander
G Coop
Gary S. Johnson
H. G Parker
H. G Parker
H. G Parker
H. M Kang
Heidi G. Parker
Hopi E. Hoekstra
J Flint
J Pritchard
J Yu
J. A Kerns
J. M Akey
Jeffrey J. Schoenebeck
Jeremiah D. Degenhardt
John Novembre
K Zhao
K. E Lohmueller
K. G Lark
Keyan Zhao
Kirk E. Lohmueller
Lin Li
M. M Gray
M. N Weedon
Marta Castelhano
Melissa J. Hubisz
Michele Cargill
N Patterson
N. B Sutter
N. B Sutter
N. B Sutter
N. H. C Salmon Hillbertz
Nathan B. Sutter
P Jones
P Scheet
P. F Colosimo
P. M Visscher
Pascale Quignon
R Coppinger
R. K Wayne
R. K Wayne
R. K Wayne
Robert K. Wayne
S Rozen
S Wright
S. I Candille
S. M Schmutz
T. A Manolio
X Zhou
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

The largest genetic study to date of morphology in domestic dogs identifies genes controlling nearly 100 morphological traits and identifies important trends in phenotypic variation within this species

Crossref

Cold Spring Harbor Laboratory Institutional Repository

HAL-Inserm

Directory of Open Access Journals

PubMed Central

HAL Descartes

Edinburgh Research Explorer

Hal-Diderot

HAL-Rennes 1

Error-prone polymerase activity causes multinucleotide mutations in humans

Author: Harris Kelley
Nielsen Rasmus
Publication venue
Publication date: 29/04/2014
Field of study

About 2% of human genetic polymorphisms have been hypothesized to arise via multinucleotide mutations (MNMs), complex events that generate SNPs at multiple sites in a single generation. MNMs have the potential to accelerate the pace at which single genes evolve and to confound studies of demography and selection that assume all SNPs arise independently. In this paper, we examine clustered mutations that are segregating in a set of 1,092 human genomes, demonstrating that MNMs become enriched as large numbers of individuals are sampled. We leverage the size of the dataset to deduce new information about the allelic spectrum of MNMs, estimating the percentage of linked SNP pairs that were generated by simultaneous mutation as a function of the distance between the affected sites and showing that MNMs exhibit a high percentage of transversions relative to transitions. These findings are reproducible in data from multiple sequencing platforms. Among tandem mutations that occur simultaneously at adjacent sites, we find an especially skewed distribution of ancestral and derived dinucleotides, with

\textrm{GC}\to \textrm{AA}

\textrm{GA}\to \textrm{TT}

and their reverse complements making up 36% of the total. These same mutations dominate the spectrum of tandem mutations produced by the upregulation of low-fidelity Polymerase

\zeta

in mutator strains of S. cerevisiae that have impaired DNA excision repair machinery. This suggests that low-fidelity DNA replication by Pol

\zeta

is at least partly responsible for the MNMs that are segregating in the human population, and that useful information about the biochemistry of MNM can be extracted from ordinary population genomic data. We incorporate our findings into a mathematical model of the multinucleotide mutation process that can be used to correct phylogenetic and population genetic methods for the presence of MNMs

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

PubMed Central

eScholarship - University of California

FPGAs in Bioinformatics: Implementation and Evaluation of Common Bioinformatics Algorithms in Reconfigurable Logic

Author: Wienbrandt Lars
Publication venue
Publication date: 01/01/2016
Field of study

Life. Much effort is taken to grant humanity a little insight in this fascinating and complex but fundamental topic. In order to understand the relations and to derive consequences humans have begun to sequence their genomes, i.e. to determine their DNA sequences to infer information, e.g. related to genetic diseases. The process of DNA sequencing as well as subsequent analysis presents a computational challenge for recent computing systems due to the large amounts of data alone. Runtimes of more than one day for analysis of simple datasets are common, even if the process is already run on a CPU cluster. This thesis shows how this general problem in the area of bioinformatics can be tackled with reconfigurable hardware, especially FPGAs. Three compute intensive problems are highlighted: sequence alignment, SNP interaction analysis and genotype imputation. In the area of sequence alignment the software BLASTp for protein database searches is exemplarily presented, implemented and evaluated.SNP interaction analysis is presented with three applications performing an exhaustive search for interactions including the corresponding statistical tests: BOOST, iLOCi and the mutual information measurement. All applications are implemented in FPGA-hardware and evaluated, resulting in an impressive speedup of more than in three orders of magnitude when compared to standard computers. The last topic of genotype imputation presents a two-step process composed of the phasing step and the actual imputation step. The focus lies on the phasing step which is targeted by the SHAPEIT2 application. SHAPEIT2 is discussed with its underlying mathematical methods in detail, and finally implemented and evaluated. A remarkable speedup of 46 is reached here as well

MACAU: Open Access Repository of Kiel University

Recommended from our members

Exome sequencing of Finnish isolates enhances rare-variant association power.

Exome-sequencing studies have generally been underpowered to identify deleterious alleles with a large effect on complex traits as such alleles are mostly rare. Because the population of northern and eastern Finland has expanded considerably and in isolation following a series of bottlenecks, individuals of these populations have numerous deleterious alleles at a relatively high frequency. Here, using exome sequencing of nearly 20,000 individuals from these regions, we investigate the role of rare coding variants in clinically relevant quantitative cardiometabolic traits. Exome-wide association studies for 64 quantitative traits identified 26 newly associated deleterious alleles. Of these 26 alleles, 19 are either unique to or more than 20 times more frequent in Finnish individuals than in other Europeans and show geographical clustering comparable to Mendelian disease mutations that are characteristic of the Finnish population. We estimate that sequencing studies of populations without this unique history would require hundreds of thousands to millions of participants to achieve comparable association power

eScholarship - University of California