1,194 research outputs found
Hidden breakpoints in genome alignments
During the course of evolution, an organism's genome can undergo changes that
affect the large-scale structure of the genome. These changes include gene
gain, loss, duplication, chromosome fusion, fission, and rearrangement. When
gene gain and loss occurs in addition to other types of rearrangement,
breakpoints of rearrangement can exist that are only detectable by comparison
of three or more genomes. An arbitrarily large number of these "hidden"
breakpoints can exist among genomes that exhibit no rearrangements in pairwise
comparisons.
We present an extension of the multichromosomal breakpoint median problem to
genomes that have undergone gene gain and loss. We then demonstrate that the
median distance among three genomes can be used to calculate a lower bound on
the number of hidden breakpoints present. We provide an implementation of this
calculation including the median distance, along with some practical
improvements on the time complexity of the underlying algorithm.
We apply our approach to measure the abundance of hidden breakpoints in
simulated data sets under a wide range of evolutionary scenarios. We
demonstrate that in simulations the hidden breakpoint counts depend strongly on
relative rates of inversion and gene gain/loss. Finally we apply current
multiple genome aligners to the simulated genomes, and show that all aligners
introduce a high degree of error in hidden breakpoint counts, and that this
error grows with evolutionary distance in the simulation. Our results suggest
that hidden breakpoint error may be pervasive in genome alignments.Comment: 13 pages, 4 figure
Landscape of standing variation for tandem duplications in Drosophila yakuba and Drosophila simulans
We have used whole genome paired-end Illumina sequence data to identify
tandem duplications in 20 isofemale lines of D. yakuba, and 20 isofemale lines
of D. simulans and performed genome wide validation with PacBio long molecule
sequencing. We identify 1,415 tandem duplications that are segregating in D.
yakuba as well as 975 duplications in D. simulans, indicating greater variation
in D. yakuba. Additionally, we observe high rates of secondary deletions at
duplicated sites, with 8% of duplicated sites in D. simulans and 17% of sites
in D. yakuba modified with deletions. These secondary deletions are consistent
with the action of the large loop mismatch repair system acting to remove
polymorphic tandem duplication, resulting in rapid dynamics of gain and loss in
duplicated alleles and a richer substrate of genetic novelty than has been
previously reported. Most duplications are present in only single strains,
suggesting deleterious impacts are common. D. simulans shows larger numbers of
whole gene duplications in comparison to larger proportions of gene fragments
in D. yakuba. D. simulans displays an excess of high frequency variants on the
X chromosome, consistent with adaptive evolution through duplications on the D.
simulans X or demographic forces driving duplicates to high frequency. We
identify 78 chimeric genes in D. yakuba and 38 chimeric genes in D. simulans,
as well as 143 cases of recruited non-coding sequence in D. yakuba and 96 in D.
simulans, in agreement with rates of chimeric gene origination in D.
melanogaster. Together, these results suggest that tandem duplications often
result in complex variation beyond whole gene duplications that offers a rich
substrate of standing variation that is likely to contribute both to
detrimental phenotypes and disease, as well as to adaptive evolutionary change.Comment: Revised Version- Accepted at Molecular Biology and Evolutio
De novo human genome assemblies reveal spectrum of alternative haplotypes in diverse populations.
The human reference genome is used extensively in modern biological research. However, a single consensus representation is inadequate to provide a universal reference structure because it is a haplotype among many in the human population. Using 10× Genomics (10×G) "Linked-Read" technology, we perform whole genome sequencing (WGS) and de novo assembly on 17 individuals across five populations. We identify 1842 breakpoint-resolved non-reference unique insertions (NUIs) that, in aggregate, add up to 2.1 Mb of so far undescribed genomic content. Among these, 64% are considered ancestral to humans since they are found in non-human primate genomes. Furthermore, 37% of the NUIs can be found in the human transcriptome and 14% likely arose from Alu-recombination-mediated deletion. Our results underline the need of a set of human reference genomes that includes a comprehensive list of alternative haplotypes to depict the complete spectrum of genetic diversity across populations
Accurate Detection of Recombinant Breakpoints in Whole-Genome Alignments
We propose a novel method for detecting sites of molecular recombination in multiple alignments. Our approach is a compromise between previous extremes of computationally prohibitive but mathematically rigorous methods and imprecise heuristic methods. Using a combined algorithm for estimating tree structure and hidden Markov model parameters, our program detects changes in phylogenetic tree topology over a multiple sequence alignment. We evaluate our method on benchmark datasets from previous studies on two recombinant pathogens, Neisseria and HIV-1, as well as simulated data. We show that we are not only able to detect recombinant regions of vastly different sizes but also the location of breakpoints with great accuracy. We show that our method does well inferring recombination breakpoints while at the same time maintaining practicality for larger datasets. In all cases, we confirm the breakpoint predictions of previous studies, and in many cases we offer novel predictions
Progressive Mauve: Multiple alignment of genomes with gene flux and rearrangement
Multiple genome alignment remains a challenging problem. Effects of
recombination including rearrangement, segmental duplication, gain, and loss
can create a mosaic pattern of homology even among closely related organisms.
We describe a method to align two or more genomes that have undergone
large-scale recombination, particularly genomes that have undergone substantial
amounts of gene gain and loss (gene flux). The method utilizes a novel
alignment objective score, referred to as a sum-of-pairs breakpoint score. We
also apply a probabilistic alignment filtering method to remove erroneous
alignments of unrelated sequences, which are commonly observed in other genome
alignment methods. We describe new metrics for quantifying genome alignment
accuracy which measure the quality of rearrangement breakpoint predictions and
indel predictions. The progressive genome alignment algorithm demonstrates
markedly improved accuracy over previous approaches in situations where genomes
have undergone realistic amounts of genome rearrangement, gene gain, loss, and
duplication. We apply the progressive genome alignment algorithm to a set of 23
completely sequenced genomes from the genera Escherichia, Shigella, and
Salmonella. The 23 enterobacteria have an estimated 2.46Mbp of genomic content
conserved among all taxa and total unique content of 15.2Mbp. We document
substantial population-level variability among these organisms driven by
homologous recombination, gene gain, and gene loss. Free, open-source software
implementing the described genome alignment approach is available from
http://gel.ahabs.wisc.edu/mauve .Comment: Revision dated June 19, 200
Parametric inference of recombination in HIV genomes
Recombination is an important event in the evolution of HIV. It affects the
global spread of the pandemic as well as evolutionary escape from host immune
response and from drug therapy within single patients. Comprehensive
computational methods are needed for detecting recombinant sequences in large
databases, and for inferring the parental sequences.
We present a hidden Markov model to annotate a query sequence as a
recombinant of a given set of aligned sequences. Parametric inference is used
to determine all optimal annotations for all parameters of the model. We show
that the inferred annotations recover most features of established hand-curated
annotations. Thus, parametric analysis of the hidden Markov model is feasible
for HIV full-length genomes, and it improves the detection and annotation of
recombinant forms.
All computational results, reference alignments, and C++ source code are
available at http://bio.math.berkeley.edu/recombination/.Comment: 20 pages, 5 figure
An HMM-based Comparative Genomic Framework for Detecting Introgression in Eukaryotes
One outcome of interspecific hybridization and subsequent effects of
evolutionary forces is introgression, which is the integration of genetic
material from one species into the genome of an individual in another species.
The evolution of several groups of eukaryotic species has involved
hybridization, and cases of adaptation through introgression have been already
established. In this work, we report on a new comparative genomic framework for
detecting introgression in genomes, called PhyloNet-HMM, which combines
phylogenetic networks, that capture reticulate evolutionary relationships among
genomes, with hidden Markov models (HMMs), that capture dependencies within
genomes. A novel aspect of our work is that it also accounts for incomplete
lineage sorting and dependence across loci.
Application of our model to variation data from chromosome 7 in the mouse
(Mus musculus domesticus) genome detects a recently reported adaptive
introgression event involving the rodent poison resistance gene Vkorc1, in
addition to other newly detected introgression regions. Based on our analysis,
it is estimated that about 12% of all sites withinchromosome 7 are of
introgressive origin (these cover about 18 Mbp of chromosome 7, and over 300
genes). Further, our model detects no introgression in two negative control
data sets. Our work provides a powerful framework for systematic analysis of
introgression while simultaneously accounting for dependence across sites,
point mutations, recombination, and ancestral polymorphism
A jumping profile Hidden Markov Model and applications to recombination sites in HIV and HCV genomes
BACKGROUND: Jumping alignments have recently been proposed as a strategy to search a given multiple sequence alignment A against a database. Instead of comparing a database sequence S to the multiple alignment or profile as a whole, S is compared and aligned to individual sequences from A. Within this alignment, S can jump between different sequences from A, so different parts of S can be aligned to different sequences from the input multiple alignment. This approach is particularly useful for dealing with recombination events. RESULTS: We developed a jumping profile Hidden Markov Model (jpHMM), a probabilistic generalization of the jumping-alignment approach. Given a partition of the aligned input sequence family into known sequence subtypes, our model can jump between states corresponding to these different subtypes, depending on which subtype is locally most similar to a database sequence. Jumps between different subtypes are indicative of intersubtype recombinations. We applied our method to a large set of genome sequences from human immunodeficiency virus (HIV) and hepatitis C virus (HCV) as well as to simulated recombined genome sequences. CONCLUSION: Our results demonstrate that jumps in our jumping profile HMM often correspond to recombination breakpoints; our approach can therefore be used to detect recombinations in genomic sequences. The recombination breakpoints identified by jpHMM were found to be significantly more accurate than breakpoints defined by traditional methods based on comparing single representative sequences
The inference of gene trees with species trees
Molecular phylogeny has focused mainly on improving models for the
reconstruction of gene trees based on sequence alignments. Yet, most
phylogeneticists seek to reveal the history of species. Although the histories
of genes and species are tightly linked, they are seldom identical, because
genes duplicate, are lost or horizontally transferred, and because alleles can
co-exist in populations for periods that may span several speciation events.
Building models describing the relationship between gene and species trees can
thus improve the reconstruction of gene trees when a species tree is known, and
vice-versa. Several approaches have been proposed to solve the problem in one
direction or the other, but in general neither gene trees nor species trees are
known. Only a few studies have attempted to jointly infer gene trees and
species trees. In this article we review the various models that have been used
to describe the relationship between gene trees and species trees. These models
account for gene duplication and loss, transfer or incomplete lineage sorting.
Some of them consider several types of events together, but none exists
currently that considers the full repertoire of processes that generate gene
trees along the species tree. Simulations as well as empirical studies on
genomic data show that combining gene tree-species tree models with models of
sequence evolution improves gene tree reconstruction. In turn, these better
gene trees provide a better basis for studying genome evolution or
reconstructing ancestral chromosomes and ancestral gene sequences. We predict
that gene tree-species tree methods that can deal with genomic data sets will
be instrumental to advancing our understanding of genomic evolution.Comment: Review article in relation to the "Mathematical and Computational
Evolutionary Biology" conference, Montpellier, 201
jpHMM at GOBICS: a web server to detect genomic recombinations in HIV-1
Detecting recombinations in the genome sequence of human immunodeficiency virus (HIV-1) is crucial for epidemiological studies and for vaccine development. Herein, we present a web server for subtyping and localization of phylogenetic breakpoints in HIV-1. Our software is based on a jumping profile Hidden Markov Model (jpHMM), a probabilistic generalization of the jumping-alignment approach proposed by Spang et al. The input data for our server is a partial or complete genome sequence from HIV-1; our tool assigns regions of the input sequence to known subtypes of HIV-1 and predicts phylogenetic breakpoints. jpHMM is available online at
- …