239 research outputs found
Boosting forward-time population genetic simulators through genotype compression
Background:
Forward-time population genetic simulations play a central role in deriving and testing evolutionary
hypotheses. Such simulations may be data-intensive, depending on the settings to the various param-
eters controlling them. In particular, for certain settings, the data footprint may quickly exceed the
memory of a single compute node.
Results:
We develop a novel and general method for addressing the memory issue inherent in forward-time
simulations by compressing and decompressing, in real-time, active and ancestral genotypes, while
carefully accounting for the time overhead. We propose a general graph data structure for compressing
the genotype space explored during a simulation run, along with efficient algorithms for constructing
and updating compressed genotypes which support both mutation and recombination. We tested the
performance of our method in very large-scale simulations. Results show that our method not only
scales well, but that it also overcomes memory issues that would cripple existing tools.
Conclusions:
As evolutionary analyses are being increasingly performed on genomes, pathways, and networks,
particularly in the era of systems biology, scaling population genetic simulators to handle large-scale
simulations is crucial. We believe our method offers a significant step in that direction. Further, the
techniques we provide are generic and can be integrated with existing population genetic simulators
to boost their performance in terms of memory usage
ncDNA and drift drive binding site accumulation
Background: The amount of transcription factor binding sites (TFBS) in an organism's genome positively correlates
with the complexity of the regulatory network of the organism. However, the manner by which TFBS arise and
accumulate in genomes and the effects of regulatory network complexity on the organism's fitness are far from being
known. The availability of TFBS data from many organisms provides an opportunity to explore these issues, particularly
from an evolutionary perspective.
Results: We analyzed TFBS data from five model organisms -- E. coli K12, S. cerevisiae, C. elegans, D. melanogaster, A.
thaliana -- and found a positive correlation between the amount of non-coding DNA (ncDNA) in the organismメs
genome and regulatory complexity. Based on this finding, we hypothesize that the amount of ncDNA, combined with
the population size, can explain the patterns of regulatory complexity across organisms. To test this hypothesis, we
devised a genome-based regulatory pathway model and subjected it to the forces of evolution through population
genetic simulations. The results support our hypothesis, showing neutral evolutionary forces alone can explain TFBS
patterns, and that selection on the regulatory network function does not alter this finding.
Conclusions: The cis-regulome is not a clean functional network crafted by adaptive forces alone, but instead a data
source filled with the noise of non-adaptive forces. From a regulatory perspective, this evolutionary noise manifests as
complexity on both the binding site and pathway level, which has significant implications on many directions in
microbiology, genetics, and synthetic biology
Convergent evolution of modularity in metabolic networks through different community structures
Background: It has been reported that the modularity of metabolic networks of bacteria is closely related to the
variability of their living habitats. However, given the dependency of themodularity score on the community structure,
it remains unknown whether organisms achieve certain modularity via similar or different community structures.
Results: In this work, we studied the relationship between similarities in modularity scores and similarities in
community structures of the metabolic networks of 1021 species. Both similarities are then compared against the
genetic distances. We revisited the association between modularity and variability of the microbial living
environments and extended the analysis to other aspects of their life style such as temperature and oxygen
requirements. We also tested both topological and biological intuition of the community structures identified and
investigated the extent of their conservation with respect to the taxomony.
Conclusions: We find that similar modularities are realized by different community structures. We find that such
convergent evolution of modularity is closely associated with the number of (distinct) enzymes in the organismメs
metabolome, a consequence of different life styles of the species. We find that the order of modularity is the same as
the order of the number of the enzymes under the classification based on the temperature preference but not on the
oxygen requirement. Besides, inspection of modularity-based communities reveals that these communities are
graph-theoretically meaningful yet not reflective of specific biological functions. From an evolutionary perspective,
we find that the community structures are conserved only at the level of kingdoms. Our results call for more
investigation into the interplay between evolution and modularity: how evolution shapes modularity, and how
modularity affects evolution (mainly in terms of fitness and evolvability). Further, our results call for exploring new
measures of modularity and network communities that better correspond to functional categorizations
Empirical Performance of Tree-Based Inference of Phylogenetic Networks
Phylogenetic networks extend the phylogenetic tree structure and allow for modeling vertical and horizontal evolution in a single framework. Statistical inference of phylogenetic networks is prohibitive and currently limited to small networks. An approach that could significantly improve phylogenetic network space exploration is based on first inferring an evolutionary tree of the species under consideration, and then augmenting the tree into a network by adding a set of "horizontal" edges to better fit the data.
In this paper, we study the performance of such an approach on networks generated under a birth-hybridization model and explore its feasibility as an alternative to approaches that search the phylogenetic network space directly (without relying on a fixed underlying tree). We find that the concatenation method does poorly at obtaining a "backbone" tree that could be augmented into the correct network, whereas the popular species tree inference method ASTRAL does significantly better at such a task. We then evaluated the tree-to-network augmentation phase under the minimizing deep coalescence and pseudo-likelihood criteria. We find that even though this is a much faster approach than the direct search of the network space, the accuracy is much poorer, even when the backbone tree is a good starting tree.
Our results show that tree-based inference of phylogenetic networks could yield very poor results. As exploration of the network space directly in search of maximum likelihood estimates or a representative sample of the posterior is very expensive, significant improvements to the computational complexity of phylogenetic network inference are imperative if analyses of large data sets are to be performed. We show that a recently developed divide-and-conquer approach significantly outperforms tree-based inference in terms of accuracy, albeit still at a higher computational cost
A Combinatorial Approach for Single-cell Variant Detection via Phylogenetic Inference
Single-cell sequencing provides a powerful approach for elucidating intratumor heterogeneity by resolving cell-to-cell variability. However, it also poses additional challenges including elevated error rates, allelic dropout and non-uniform coverage. A recently introduced single-cell-specific mutation detection algorithm leverages the evolutionary relationship between cells for denoising the data. However, due to its probabilistic nature, this method does not scale well with the number of cells. Here, we develop a novel combinatorial approach for utilizing the genealogical relationship of cells in detecting mutations from noisy single-cell sequencing data. Our method, called scVILP, jointly detects mutations in individual cells and reconstructs a perfect phylogeny among these cells. We employ a novel Integer Linear Program algorithm for deterministically and efficiently solving the joint inference problem. We show that scVILP achieves similar or better accuracy but significantly better runtime over existing methods on simulated data. We also applied scVILP to an empirical human cancer dataset from a high grade serous ovarian cancer patient
Progress on Constructing Phylogenetic Networks for Languages
In 2006, Warnow, Evans, Ringe, and Nakhleh proposed a stochastic model
(hereafter, the WERN 2006 model) of multi-state linguistic character evolution
that allowed for homoplasy and borrowing. They proved that if there is no
borrowing between languages and homoplastic states are known in advance, then
the phylogenetic tree of a set of languages is statistically identifiable under
this model, and they presented statistically consistent methods for estimating
these phylogenetic trees. However, they left open the question of whether a
phylogenetic network -- which would explicitly model borrowing between
languages that are in contact -- can be estimated under the model of character
evolution. Here, we establish that under some mild additional constraints on
the WERN 2006 model, the phylogenetic network topology is statistically
identifiable, and we present algorithms to infer the phylogenetic network. We
discuss the ramifications for linguistic phylogenetic network estimation in
practice, and suggest directions for future research.Comment: 16 pages, 2 figure
Bootstrap-based Support of HGT Inferred by Maximum Parsimony
Background: Maximum parsimony is one of the most commonly used criteria for reconstructing phylogenetic trees. Recently, Nakhleh and co-workers extended this criterion to enable reconstruction of phylogenetic networks, and demonstrated its application to detecting reticulate evolutionary relationships. However, one of the major problems with this extension has been that it favors more complex evolutionary relationships over simpler ones, thus having the potential for overestimating the amount of reticulation in the data. An ad hoc solution to this problem that has been used entails inspecting the improvement in the parsimony length as more reticulation events are added to the model, and stopping when the improvement is below a certain threshold. Results: In this paper, we address this problem in a more systematic way, by proposing a nonparametric bootstrapbased measure of support of inferred reticulation events, and using it to determine the number of those events, as well as their placements. A number of samples is generated from the given sequence alignment, and reticulation events are inferred based on each sample. Finally, the support of each reticulation event is quantified based on the inferences made over all samples. Conclusions: We have implemented our method in the NEPAL software tool (available publicly a
An HMM-based Comparative Genomic Framework for Detecting Introgression in Eukaryotes
One outcome of interspecific hybridization and subsequent effects of
evolutionary forces is introgression, which is the integration of genetic
material from one species into the genome of an individual in another species.
The evolution of several groups of eukaryotic species has involved
hybridization, and cases of adaptation through introgression have been already
established. In this work, we report on a new comparative genomic framework for
detecting introgression in genomes, called PhyloNet-HMM, which combines
phylogenetic networks, that capture reticulate evolutionary relationships among
genomes, with hidden Markov models (HMMs), that capture dependencies within
genomes. A novel aspect of our work is that it also accounts for incomplete
lineage sorting and dependence across loci.
Application of our model to variation data from chromosome 7 in the mouse
(Mus musculus domesticus) genome detects a recently reported adaptive
introgression event involving the rodent poison resistance gene Vkorc1, in
addition to other newly detected introgression regions. Based on our analysis,
it is estimated that about 12% of all sites withinchromosome 7 are of
introgressive origin (these cover about 18 Mbp of chromosome 7, and over 300
genes). Further, our model detects no introgression in two negative control
data sets. Our work provides a powerful framework for systematic analysis of
introgression while simultaneously accounting for dependence across sites,
point mutations, recombination, and ancestral polymorphism
- …