9,438 research outputs found
Joint assembly and genetic mapping of the Atlantic horseshoe crab genome reveals ancient whole genome duplication
Horseshoe crabs are marine arthropods with a fossil record extending back
approximately 450 million years. They exhibit remarkable morphological
stability over their long evolutionary history, retaining a number of ancestral
arthropod traits, and are often cited as examples of "living fossils." As
arthropods, they belong to the Ecdysozoa}, an ancient super-phylum whose
sequenced genomes (including insects and nematodes) have thus far shown more
divergence from the ancestral pattern of eumetazoan genome organization than
cnidarians, deuterostomes, and lophotrochozoans. However, much of ecdysozoan
diversity remains unrepresented in comparative genomic analyses. Here we use a
new strategy of combined de novo assembly and genetic mapping to examine the
chromosome-scale genome organization of the Atlantic horseshoe crab Limulus
polyphemus. We constructed a genetic linkage map of this 2.7 Gbp genome by
sequencing the nuclear DNA of 34 wild-collected, full-sibling embryos and their
parents at a mean redundancy of 1.1x per sample. The map includes 84,307
sequence markers and 5,775 candidate conserved protein coding genes. Comparison
to other metazoan genomes shows that the L. polyphemus genome preserves
ancestral bilaterian linkage groups, and that a common ancestor of modern
horseshoe crabs underwent one or more ancient whole genome duplications (WGDs)
~ 300 MYA, followed by extensive chromosome fusion
Inference of Ancestral Recombination Graphs through Topological Data Analysis
The recent explosion of genomic data has underscored the need for
interpretable and comprehensive analyses that can capture complex phylogenetic
relationships within and across species. Recombination, reassortment and
horizontal gene transfer constitute examples of pervasive biological phenomena
that cannot be captured by tree-like representations. Starting from hundreds of
genomes, we are interested in the reconstruction of potential evolutionary
histories leading to the observed data. Ancestral recombination graphs
represent potential histories that explicitly accommodate recombination and
mutation events across orthologous genomes. However, they are computationally
costly to reconstruct, usually being infeasible for more than few tens of
genomes. Recently, Topological Data Analysis (TDA) methods have been proposed
as robust and scalable methods that can capture the genetic scale and frequency
of recombination. We build upon previous TDA developments for detecting and
quantifying recombination, and present a novel framework that can be applied to
hundreds of genomes and can be interpreted in terms of minimal histories of
mutation and recombination events, quantifying the scales and identifying the
genomic locations of recombinations. We implement this framework in a software
package, called TARGet, and apply it to several examples, including small
migration between different populations, human recombination, and horizontal
evolution in finches inhabiting the Gal\'apagos Islands.Comment: 33 pages, 12 figures. The accompanying software, instructions and
example files used in the manuscript can be obtained from
https://github.com/RabadanLab/TARGe
A Bayesian phylogenetic hidden Markov model for B cell receptor sequence analysis.
The human body generates a diverse set of high affinity antibodies, the soluble form of B cell receptors (BCRs), that bind to and neutralize invading pathogens. The natural development of BCRs must be understood in order to design vaccines for highly mutable pathogens such as influenza and HIV. BCR diversity is induced by naturally occurring combinatorial "V(D)J" rearrangement, mutation, and selection processes. Most current methods for BCR sequence analysis focus on separately modeling the above processes. Statistical phylogenetic methods are often used to model the mutational dynamics of BCR sequence data, but these techniques do not consider all the complexities associated with B cell diversification such as the V(D)J rearrangement process. In particular, standard phylogenetic approaches assume the DNA bases of the progenitor (or "naive") sequence arise independently and according to the same distribution, ignoring the complexities of V(D)J rearrangement. In this paper, we introduce a novel approach to Bayesian phylogenetic inference for BCR sequences that is based on a phylogenetic hidden Markov model (phylo-HMM). This technique not only integrates a naive rearrangement model with a phylogenetic model for BCR sequence evolution but also naturally accounts for uncertainty in all unobserved variables, including the phylogenetic tree, via posterior distribution sampling
Lineage specific recombination rates and microevolution in Listeria monocytogenes
Background: The bacterium Listeria monocytogenes is a saprotroph as well as an opportunistic human foodborne pathogen, which has previously been shown to consist of at least two widespread lineages (termed lineages I and II) and an uncommon lineage (lineage III). While some L. monocytogenes strains show evidence for considerable diversification by homologous recombination, our understanding of the contribution of recombination to L. monocytogenes evolution is still limited. We therefore used
STRUCTURE and ClonalFrame, two programs that model the effect of recombination, to make inferences about the population structure and different aspects of the recombination process in L. monocytogenes. Analyses were performed using sequences for seven loci (including the house-keeping genes gap, prs, purM and ribC, the stress response gene sigB, and the virulence genes actA and inlA) for 195 L. monocytogenes isolates.
Results: Sequence analyses with ClonalFrame and the Sawyer's test showed that recombination is more
prevalent in lineage II than lineage I and is most frequent in two house-keeping genes (ribC and purM) and the two virulence genes (actA and inlA). The relative occurrence of recombination versus point mutation is about six times higher in lineage II than in lineage I, which causes a higher genetic variability in lineage II. Unlike lineage I, lineage II represents a genetically heterogeneous population with a relatively high proportion (30% average) of genetic material imported from external sources. Phylograms, constructed with correcting for recombination, as well as Tajima's D data suggest that both lineages I and II have suffered a population bottleneck.
Conclusion: Our study shows that evolutionary lineages within a single bacterial species can differ
considerably in the relative contributions of recombination to genetic diversification. Accounting for recombination in phylogenetic studies is critical, and new evolutionary models that account for the possibility of changes in the rate of recombination would be required. While previous studies suggested that only L. monocytogenes lineage I has experienced a recent bottleneck, our analyses clearly show that lineage II experienced a bottleneck at about the same time, which was subsequently obscured by abundant
homologous recombination after the lineage II bottleneck. While lineage I and lineage II should be considered separate species from an evolutionary viewpoint, maintaining single species name may be warranted since both lineages cause the same type of human disease
Genome-wide inference of ancestral recombination graphs
The complex correlation structure of a collection of orthologous DNA
sequences is uniquely captured by the "ancestral recombination graph" (ARG), a
complete record of coalescence and recombination events in the history of the
sample. However, existing methods for ARG inference are computationally
intensive, highly approximate, or limited to small numbers of sequences, and,
as a consequence, explicit ARG inference is rarely used in applied population
genomics. Here, we introduce a new algorithm for ARG inference that is
efficient enough to apply to dozens of complete mammalian genomes. The key idea
of our approach is to sample an ARG of n chromosomes conditional on an ARG of
n-1 chromosomes, an operation we call "threading." Using techniques based on
hidden Markov models, we can perform this threading operation exactly, up to
the assumptions of the sequentially Markov coalescent and a discretization of
time. An extension allows for threading of subtrees instead of individual
sequences. Repeated application of these threading operations results in highly
efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these
methods in a computer program called ARGweaver. Experiments with simulated data
indicate that ARGweaver converges rapidly to the true posterior distribution
and is effective in recovering various features of the ARG for dozens of
sequences generated under realistic parameters for human populations. In
applications of ARGweaver to 54 human genome sequences from Complete Genomics,
we find clear signatures of natural selection, including regions of unusually
ancient ancestry associated with balancing selection and reductions in allele
age in sites under directional selection. Preliminary results also indicate
that our methods can be used to gain insight into complex features of human
population structure, even with a noninformative prior distribution.Comment: 88 pages, 7 main figures, 22 supplementary figures. This version
contains a substantially expanded genomic data analysi
Error-prone polymerase activity causes multinucleotide mutations in humans
About 2% of human genetic polymorphisms have been hypothesized to arise via
multinucleotide mutations (MNMs), complex events that generate SNPs at multiple
sites in a single generation. MNMs have the potential to accelerate the pace at
which single genes evolve and to confound studies of demography and selection
that assume all SNPs arise independently. In this paper, we examine clustered
mutations that are segregating in a set of 1,092 human genomes, demonstrating
that MNMs become enriched as large numbers of individuals are sampled. We
leverage the size of the dataset to deduce new information about the allelic
spectrum of MNMs, estimating the percentage of linked SNP pairs that were
generated by simultaneous mutation as a function of the distance between the
affected sites and showing that MNMs exhibit a high percentage of transversions
relative to transitions. These findings are reproducible in data from multiple
sequencing platforms. Among tandem mutations that occur simultaneously at
adjacent sites, we find an especially skewed distribution of ancestral and
derived dinucleotides, with , and their reverse complements making up 36% of the total. These
same mutations dominate the spectrum of tandem mutations produced by the
upregulation of low-fidelity Polymerase in mutator strains of S.
cerevisiae that have impaired DNA excision repair machinery. This suggests that
low-fidelity DNA replication by Pol is at least partly responsible for
the MNMs that are segregating in the human population, and that useful
information about the biochemistry of MNM can be extracted from ordinary
population genomic data. We incorporate our findings into a mathematical model
of the multinucleotide mutation process that can be used to correct
phylogenetic and population genetic methods for the presence of MNMs
- …