667 research outputs found
Reconstructing pedigrees: some identifiability questions for a recombination-mutation model
Pedigrees are directed acyclic graphs that represent ancestral relationships
between individuals in a population. Based on a schematic recombination
process, we describe two simple Markov models for sequences evolving on
pedigrees - Model R (recombinations without mutations) and Model RM
(recombinations with mutations). For these models, we ask an identifiability
question: is it possible to construct a pedigree from the joint probability
distribution of extant sequences? We present partial identifiability results
for general pedigrees: we show that when the crossover probabilities are
sufficiently small, certain spanning subgraph sequences can be counted from the
joint distribution of extant sequences. We demonstrate how pedigrees that
earlier seemed difficult to distinguish are distinguished by counting their
spanning subgraph sequences.Comment: 40 pages, 9 figure
Inference of Ancestral Recombination Graphs through Topological Data Analysis
The recent explosion of genomic data has underscored the need for
interpretable and comprehensive analyses that can capture complex phylogenetic
relationships within and across species. Recombination, reassortment and
horizontal gene transfer constitute examples of pervasive biological phenomena
that cannot be captured by tree-like representations. Starting from hundreds of
genomes, we are interested in the reconstruction of potential evolutionary
histories leading to the observed data. Ancestral recombination graphs
represent potential histories that explicitly accommodate recombination and
mutation events across orthologous genomes. However, they are computationally
costly to reconstruct, usually being infeasible for more than few tens of
genomes. Recently, Topological Data Analysis (TDA) methods have been proposed
as robust and scalable methods that can capture the genetic scale and frequency
of recombination. We build upon previous TDA developments for detecting and
quantifying recombination, and present a novel framework that can be applied to
hundreds of genomes and can be interpreted in terms of minimal histories of
mutation and recombination events, quantifying the scales and identifying the
genomic locations of recombinations. We implement this framework in a software
package, called TARGet, and apply it to several examples, including small
migration between different populations, human recombination, and horizontal
evolution in finches inhabiting the Gal\'apagos Islands.Comment: 33 pages, 12 figures. The accompanying software, instructions and
example files used in the manuscript can be obtained from
https://github.com/RabadanLab/TARGe
An efficiently computed lower bound on the number of recombinations in phylogenetic networks: Theory and empirical study
AbstractPhylogenetic networks are models of sequence evolution that go beyond trees, allowing biological operations that are not tree-like. One of the most important biological operations is recombination between two sequences. An established problem [J. Hein, Reconstructing evolution of sequences subject to recombination using parsimony, Math. Biosci. 98 (1990) 185–200; J. Hein, A heuristic method to reconstruct the history of sequences subject to recombination, J. Molecular Evoluation 36 (1993) 396–405; Y. Song, J. Hein, Parsimonious reconstruction of sequence evolution and haplotype blocks: finding the minimum number of recombination events, in: Proceedings of 2003 Workshop on Algorithms in Bioinformatics, Berlin, Germany, 2003, Lecture Notes in Computer Science, Springer, Berlin; Y. Song, J. Hein, On the minimum number of recombination events in the evolutionary history of DNA sequences, J. Math. Biol. 48 (2003) 160–186; L. Wang, K. Zhang, L. Zhang, Perfect phylogenetic networks with recombination, J. Comput. Biol. 8 (2001) 69–78; S.R. Myers, R.C. Griffiths, Bounds on the minimum number of recombination events in a sample history, Genetics 163 (2003) 375–394; V. Bafna, V. Bansal, Improved recombination lower bounds for haplotype data, in: Proceedings of RECOMB, 2005; Y. Song, Y. Wu, D. Gusfield, Efficient computation of close lower and upper bounds on the minimum number of needed recombinations in the evolution of biological sequences, Bioinformatics 21 (2005) i413–i422. Bioinformatics (Suppl. 1), Proceedings of ISMB, 2005, D. Gusfield, S. Eddhu, C. Langley, Optimal, efficient reconstruction of phylogenetic networks with constrained recombination, J. Bioinform. Comput. Biol. 2(1) (2004) 173–213; D. Gusfield, Optimal, efficient reconstruction of root-unknown phylogenetic networks with constrained and structured recombination, J. Comput. Systems Sci. 70 (2005) 381–398] is to find a phylogenetic network that derives an input set of sequences, minimizing the number of recombinations used. No efficient, general algorithm is known for this problem. Several papers consider the problem of computing a lower bound on the number of recombinations needed. In this paper we establish a new, efficiently computed lower bound. This result is useful in methods to estimate the number of needed recombinations, and also to prove the optimality of algorithms for constructing phylogenetic networks under certain conditions [D. Gusfield, S. Eddhu, C. Langley, Optimal, efficient reconstruction of phylogenetic networks with constrained recombination, J. Bioinform. Comput. Biol. 2(1) (2004) 173–213; D. Gusfield, Optimal, efficient reconstruction of root-unknown phylogenetic networks with constrained and structured recombination, J. Comput. Systems Sci. 70 (2005) 381–398; D. Gusfield, Optimal, efficient reconstruction of root-unknown phylogenetic networks with constrained recombination, Technical Report, Department of Computer Science, University of California, Davis, CA, 2004]. The lower bound is based on a structural, combinatorial insight, using only the site conflicts and incompatibilities, and hence it is fundamental and applicable to many biological phenomena other than recombination, for example, when gene conversions or recurrent or back mutations or cross-species hybridizations cause the phylogenetic history to deviate from a tree structure. In addition to establishing the bound, we examine its use in more complex lower bound methods, and compare the bounds obtained to those obtained by other established lower bound methods
Genome-wide inference of ancestral recombination graphs
The complex correlation structure of a collection of orthologous DNA
sequences is uniquely captured by the "ancestral recombination graph" (ARG), a
complete record of coalescence and recombination events in the history of the
sample. However, existing methods for ARG inference are computationally
intensive, highly approximate, or limited to small numbers of sequences, and,
as a consequence, explicit ARG inference is rarely used in applied population
genomics. Here, we introduce a new algorithm for ARG inference that is
efficient enough to apply to dozens of complete mammalian genomes. The key idea
of our approach is to sample an ARG of n chromosomes conditional on an ARG of
n-1 chromosomes, an operation we call "threading." Using techniques based on
hidden Markov models, we can perform this threading operation exactly, up to
the assumptions of the sequentially Markov coalescent and a discretization of
time. An extension allows for threading of subtrees instead of individual
sequences. Repeated application of these threading operations results in highly
efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these
methods in a computer program called ARGweaver. Experiments with simulated data
indicate that ARGweaver converges rapidly to the true posterior distribution
and is effective in recovering various features of the ARG for dozens of
sequences generated under realistic parameters for human populations. In
applications of ARGweaver to 54 human genome sequences from Complete Genomics,
we find clear signatures of natural selection, including regions of unusually
ancient ancestry associated with balancing selection and reductions in allele
age in sites under directional selection. Preliminary results also indicate
that our methods can be used to gain insight into complex features of human
population structure, even with a noninformative prior distribution.Comment: 88 pages, 7 main figures, 22 supplementary figures. This version
contains a substantially expanded genomic data analysi
Phylogenetic Detection of Recombination with a Bayesian Prior on the Distance between Trees
Genomic regions participating in recombination events may support distinct topologies, and phylogenetic analyses should incorporate this heterogeneity. Existing phylogenetic methods for recombination detection are challenged by the enormous number of possible topologies, even for a moderate number of taxa. If, however, the detection analysis is conducted independently between each putative recombinant sequence and a set of reference parentals, potential recombinations between the recombinants are neglected. In this context, a recombination hotspot can be inferred in phylogenetic analyses if we observe several consecutive breakpoints. We developed a distance measure between unrooted topologies that closely resembles the number of recombinations. By introducing a prior distribution on these recombination distances, a Bayesian hierarchical model was devised to detect phylogenetic inconsistencies occurring due to recombinations. This model relaxes the assumption of known parental sequences, still common in HIV analysis, allowing the entire dataset to be analyzed at once. On simulated datasets with up to 16 taxa, our method correctly detected recombination breakpoints and the number of recombination events for each breakpoint. The procedure is robust to rate and transition∶transversion heterogeneities for simulations with and without recombination. This recombination distance is related to recombination hotspots. Applying this procedure to a genomic HIV-1 dataset, we found evidence for hotspots and de novo recombination
Population genetics models of local ancestry
Migrations have played an important role in shaping the genetic diversity of
human populations. Understanding genomic data thus requires careful modeling of
historical gene flow. Here we consider the effect of relatively recent
population structure and gene flow, and interpret genomes of individuals that
have ancestry from multiple source populations as mosaics of segments
originating from each population. We propose general and tractable models for
describing the evolution of these patterns of local ancestry and their impact
on genetic diversity. We focus on the length distribution of continuous
ancestry tracts, and the variance in total ancestry proportions among
individuals. The proposed models offer improved agreement with Wright-Fisher
simulation data when compared to state-of-the art models, and can be used to
infer various demographic parameters in gene flow models. Considering HapMap
African-American (ASW) data, we find that a model with two distinct phases of
`European' gene flow significantly improves the modeling of both tract lengths
and ancestry variances.Comment: 25 pages with 7 figures; Genetics: Published online before print
April 4, 201
Recoverability of Ancestral Recombination Graph Topologies
Recombination is a powerful evolutionary process that shapes the genetic
diversity observed in the populations of many species. Reconstructing
genealogies in the presence of recombination from sequencing data is a very
challenging problem, as this relies on mutations having occurred on the correct
lineages in order to detect the recombination and resolve the placement of
edges in the local trees. We investigate the probability of recovering the true
topology of ancestral recombination graphs (ARGs)under the coalescent with
recombination and gene conversion. We explore how sample size and mutation rate
affect the inherent uncertainty in reconstructed ARGs; this sheds light on the
theoretical limitations of ARG reconstruction methods. We illustrate our
results using estimates of evolutionary rates for several biological organisms;
in particular, we find that for parameter values that are realistic for
SARS-CoV-2, the probability of reconstructing genealogies that are close to the
truth is low
Parametric inference of recombination in HIV genomes
Recombination is an important event in the evolution of HIV. It affects the
global spread of the pandemic as well as evolutionary escape from host immune
response and from drug therapy within single patients. Comprehensive
computational methods are needed for detecting recombinant sequences in large
databases, and for inferring the parental sequences.
We present a hidden Markov model to annotate a query sequence as a
recombinant of a given set of aligned sequences. Parametric inference is used
to determine all optimal annotations for all parameters of the model. We show
that the inferred annotations recover most features of established hand-curated
annotations. Thus, parametric analysis of the hidden Markov model is feasible
for HIV full-length genomes, and it improves the detection and annotation of
recombinant forms.
All computational results, reference alignments, and C++ source code are
available at http://bio.math.berkeley.edu/recombination/.Comment: 20 pages, 5 figure
A Survey of Combinatorial Methods for Phylogenetic Networks
The evolutionary history of a set of species is usually described by a rooted phylogenetic tree. Although it is generally undisputed that bifurcating speciation events and descent with modifications are major forces of evolution, there is a growing belief that reticulate events also have a role to play. Phylogenetic networks provide an alternative to phylogenetic trees and may be more suitable for data sets where evolution involves significant amounts of reticulate events, such as hybridization, horizontal gene transfer, or recombination. In this article, we give an introduction to the topic of phylogenetic networks, very briefly describing the fundamental concepts and summarizing some of the most important combinatorial methods that are available for their computation
- …