1,271 research outputs found
An HMM-based Comparative Genomic Framework for Detecting Introgression in Eukaryotes
One outcome of interspecific hybridization and subsequent effects of
evolutionary forces is introgression, which is the integration of genetic
material from one species into the genome of an individual in another species.
The evolution of several groups of eukaryotic species has involved
hybridization, and cases of adaptation through introgression have been already
established. In this work, we report on a new comparative genomic framework for
detecting introgression in genomes, called PhyloNet-HMM, which combines
phylogenetic networks, that capture reticulate evolutionary relationships among
genomes, with hidden Markov models (HMMs), that capture dependencies within
genomes. A novel aspect of our work is that it also accounts for incomplete
lineage sorting and dependence across loci.
Application of our model to variation data from chromosome 7 in the mouse
(Mus musculus domesticus) genome detects a recently reported adaptive
introgression event involving the rodent poison resistance gene Vkorc1, in
addition to other newly detected introgression regions. Based on our analysis,
it is estimated that about 12% of all sites withinchromosome 7 are of
introgressive origin (these cover about 18 Mbp of chromosome 7, and over 300
genes). Further, our model detects no introgression in two negative control
data sets. Our work provides a powerful framework for systematic analysis of
introgression while simultaneously accounting for dependence across sites,
point mutations, recombination, and ancestral polymorphism
Integration of Alignment and Phylogeny in the Whole-Genome Era
With the development of new sequencing techniques, whole genomes of many species have become available. This huge amount of data gives rise to new opportunities and challenges. These new sequences provide valuable information on relationships among species, e.g. genome recombination and conservation. One of the principal ways to investigate such information is multiple sequence alignment (MSA). Currently, there is large amount of MSA data on the internet, such as the UCSC genome database, but how to effectively use this information to solve classical and new problems is still an area lacking of exploration. In this thesis, we explored how to use this information in four problems, i.e. sequence orthology search problem, multiple alignment improvement problem, short read mapping problem, and genome rearrangement inference problem.
For the first problem, we developed a EM algorithm to iteratively align a query with a multiple alignment database with the information from a phylogeny relating the query species and the species in the multiple alignment. We also infer the query\u27s location in the phylogeny. We showed that by doing alignment and phylogeny inference together, we can improve the accuracies for both problems.
For the second problem, we developed an optimization algorithm to iteratively refine the multiple alignment quality. Experiment results showed our algorithm is very stable in term of resulting alignments. The results showed that our method is more accurate than existing methods, i.e. Mafft, Clustal-O, and Mavid, on test data from three sets of species from the UCSC genome database.
For the third problem, we developed a model, PhyMap, to align a read to a multiple alignment allowing mismatches and indels. PhyMap computes local alignments of a query sequence against a fixed multiple-genome alignment of closely related species. PhyMap uses a known phylogenetic tree on the species in the multiple alignment to improve the quality of its computed alignments while also estimating the placement of the query on this tree. Both theoretical computation and experiment results show that our model can differentiate between orthologous and paralogous alignments better than other popular short read mapping tools (BWA, BOWTIE and BLAST).
For the fourth problem, we gave a simple genome recombination model which can express insertions, deletions, inversions, translocations and inverted translocations on aligned genome segments. We also developed an MCMC algorithm to infer the order of the query segments. We proved that using any Euclidian metrics to measure distance between two sequence orders in the tree optimization goal function will lead to a degenerated solution where the inferred order will be the order of one of the leaf nodes. We also gave a graph-based formulation of the problem which can represent the probability distribution of the order of the query sequences
Methods for Assessing Population Relationships and History Using Genomic Data
Genetic data contain a record of our evolutionary history. The availability of
large-scale datasets of human populations from various geographic areas and
timescales, coupled with advances in the computational methods to analyze
these data, has transformed our ability to use genetic data to learn about
our evolutionary past. Here, we review some of the widely used statistical
methods to explore and characterize population relationships and history
using genomic data. We describe the intuition behind commonly used approaches, their interpretation, and important limitations. For illustration, we
apply some of these techniques to genome-wide autosomal data from 929 individuals representing 53 worldwide populations that are part of the Human
Genome Diversity Project. Finally, we discuss the new frontiers in genomic
methods to learn about population history. In sum, this review highlights
the power (and limitations) of DNA to infer features of human evolutionary
history, complementing the knowledge gleaned from other disciplines, such
as archaeology, anthropology, and linguistics
Genome-wide inference of ancestral recombination graphs
The complex correlation structure of a collection of orthologous DNA
sequences is uniquely captured by the "ancestral recombination graph" (ARG), a
complete record of coalescence and recombination events in the history of the
sample. However, existing methods for ARG inference are computationally
intensive, highly approximate, or limited to small numbers of sequences, and,
as a consequence, explicit ARG inference is rarely used in applied population
genomics. Here, we introduce a new algorithm for ARG inference that is
efficient enough to apply to dozens of complete mammalian genomes. The key idea
of our approach is to sample an ARG of n chromosomes conditional on an ARG of
n-1 chromosomes, an operation we call "threading." Using techniques based on
hidden Markov models, we can perform this threading operation exactly, up to
the assumptions of the sequentially Markov coalescent and a discretization of
time. An extension allows for threading of subtrees instead of individual
sequences. Repeated application of these threading operations results in highly
efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these
methods in a computer program called ARGweaver. Experiments with simulated data
indicate that ARGweaver converges rapidly to the true posterior distribution
and is effective in recovering various features of the ARG for dozens of
sequences generated under realistic parameters for human populations. In
applications of ARGweaver to 54 human genome sequences from Complete Genomics,
we find clear signatures of natural selection, including regions of unusually
ancient ancestry associated with balancing selection and reductions in allele
age in sites under directional selection. Preliminary results also indicate
that our methods can be used to gain insight into complex features of human
population structure, even with a noninformative prior distribution.Comment: 88 pages, 7 main figures, 22 supplementary figures. This version
contains a substantially expanded genomic data analysi
Demographic inference from multiple whole genomes using a particle filter for continuous Markov jump processes
Demographic events shape a population's genetic diversity, a process described by the coalescent-with-recombination model that relates demography and genetics by an unobserved sequence of genealogies along the genome. As the space of genealogies over genomes is large and complex, inference under this model is challenging. Formulating the coalescent-with-recombination model as a continuous-time and -space Markov jump process, we develop a particle filter for such processes, and use waypoints that under appropriate conditions allow the problem to be reduced to the discrete-time case. To improve inference, we generalise the Auxiliary Particle Filter for discrete-time models, and use Variational Bayes to model the uncertainty in parameter estimates for rare events, avoiding biases seen with Expectation Maximization. Using real and simulated genomes, we show that past population sizes can be accurately inferred over a larger range of epochs than was previously possible, opening the possibility of jointly analyzing multiple genomes under complex demographic models. Code is available at https://github.com/luntergroup/smcsmc.
The era of the ARG: an empiricist's guide to ancestral recombination graphs
In the presence of recombination, the evolutionary relationships between a
set of sampled genomes cannot be described by a single genealogical tree.
Instead, the genomes are related by a complex, interwoven collection of
genealogies formalized in a structure called an ancestral recombination graph
(ARG). An ARG extensively encodes the ancestry of the genome(s) and thus is
replete with valuable information for addressing diverse questions in
evolutionary biology. Despite its potential utility, technological and
methodological limitations, along with a lack of approachable literature, have
severely restricted awareness and application of ARGs in empirical evolution
research. Excitingly, recent progress in ARG reconstruction and simulation have
made ARG-based approaches feasible for many questions and systems. In this
review, we provide an accessible introduction and exploration of ARGs, survey
recent methodological breakthroughs, and describe the potential for ARGs to
further existing goals and open avenues of inquiry that were previously
inaccessible in evolutionary genomics. Through this discussion, we aim to more
widely disseminate the promise of ARGs in evolutionary genomics and encourage
the broader development and adoption of ARG-based inference.Comment: 34 pages, 3 figures, 3 table
Decoding coalescent hidden Markov models in linear time
In many areas of computational biology, hidden Markov models (HMMs) have been
used to model local genomic features. In particular, coalescent HMMs have been
used to infer ancient population sizes, migration rates, divergence times, and
other parameters such as mutation and recombination rates. As more loci,
sequences, and hidden states are added to the model, however, the runtime of
coalescent HMMs can quickly become prohibitive. Here we present a new algorithm
for reducing the runtime of coalescent HMMs from quadratic in the number of
hidden time states to linear, without making any additional approximations. Our
algorithm can be incorporated into various coalescent HMMs, including the
popular method PSMC for inferring variable effective population sizes. Here we
implement this algorithm to speed up our demographic inference method diCal,
which is equivalent to PSMC when applied to a sample of two haplotypes. We
demonstrate that the linear-time method can reconstruct a population size
change history more accurately than the quadratic-time method, given similar
computation resources. We also apply the method to data from the 1000 Genomes
project, inferring a high-resolution history of size changes in the European
population.Comment: 18 pages, 5 figures. To appear in the Proceedings of the 18th Annual
International Conference on Research in Computational Molecular Biology
(RECOMB 2014). The final publication is available at link.springer.co
- …