239 research outputs found
Boosting forward-time population genetic simulators through genotype compression
Background:
Forward-time population genetic simulations play a central role in deriving and testing evolutionary
hypotheses. Such simulations may be data-intensive, depending on the settings to the various param-
eters controlling them. In particular, for certain settings, the data footprint may quickly exceed the
memory of a single compute node.
Results:
We develop a novel and general method for addressing the memory issue inherent in forward-time
simulations by compressing and decompressing, in real-time, active and ancestral genotypes, while
carefully accounting for the time overhead. We propose a general graph data structure for compressing
the genotype space explored during a simulation run, along with efficient algorithms for constructing
and updating compressed genotypes which support both mutation and recombination. We tested the
performance of our method in very large-scale simulations. Results show that our method not only
scales well, but that it also overcomes memory issues that would cripple existing tools.
Conclusions:
As evolutionary analyses are being increasingly performed on genomes, pathways, and networks,
particularly in the era of systems biology, scaling population genetic simulators to handle large-scale
simulations is crucial. We believe our method offers a significant step in that direction. Further, the
techniques we provide are generic and can be integrated with existing population genetic simulators
to boost their performance in terms of memory usage
ncDNA and drift drive binding site accumulation
Background: The amount of transcription factor binding sites (TFBS) in an organism's genome positively correlates
with the complexity of the regulatory network of the organism. However, the manner by which TFBS arise and
accumulate in genomes and the effects of regulatory network complexity on the organism's fitness are far from being
known. The availability of TFBS data from many organisms provides an opportunity to explore these issues, particularly
from an evolutionary perspective.
Results: We analyzed TFBS data from five model organisms -- E. coli K12, S. cerevisiae, C. elegans, D. melanogaster, A.
thaliana -- and found a positive correlation between the amount of non-coding DNA (ncDNA) in the organismメs
genome and regulatory complexity. Based on this finding, we hypothesize that the amount of ncDNA, combined with
the population size, can explain the patterns of regulatory complexity across organisms. To test this hypothesis, we
devised a genome-based regulatory pathway model and subjected it to the forces of evolution through population
genetic simulations. The results support our hypothesis, showing neutral evolutionary forces alone can explain TFBS
patterns, and that selection on the regulatory network function does not alter this finding.
Conclusions: The cis-regulome is not a clean functional network crafted by adaptive forces alone, but instead a data
source filled with the noise of non-adaptive forces. From a regulatory perspective, this evolutionary noise manifests as
complexity on both the binding site and pathway level, which has significant implications on many directions in
microbiology, genetics, and synthetic biology
Convergent evolution of modularity in metabolic networks through different community structures
Background: It has been reported that the modularity of metabolic networks of bacteria is closely related to the
variability of their living habitats. However, given the dependency of themodularity score on the community structure,
it remains unknown whether organisms achieve certain modularity via similar or different community structures.
Results: In this work, we studied the relationship between similarities in modularity scores and similarities in
community structures of the metabolic networks of 1021 species. Both similarities are then compared against the
genetic distances. We revisited the association between modularity and variability of the microbial living
environments and extended the analysis to other aspects of their life style such as temperature and oxygen
requirements. We also tested both topological and biological intuition of the community structures identified and
investigated the extent of their conservation with respect to the taxomony.
Conclusions: We find that similar modularities are realized by different community structures. We find that such
convergent evolution of modularity is closely associated with the number of (distinct) enzymes in the organismメs
metabolome, a consequence of different life styles of the species. We find that the order of modularity is the same as
the order of the number of the enzymes under the classification based on the temperature preference but not on the
oxygen requirement. Besides, inspection of modularity-based communities reveals that these communities are
graph-theoretically meaningful yet not reflective of specific biological functions. From an evolutionary perspective,
we find that the community structures are conserved only at the level of kingdoms. Our results call for more
investigation into the interplay between evolution and modularity: how evolution shapes modularity, and how
modularity affects evolution (mainly in terms of fitness and evolvability). Further, our results call for exploring new
measures of modularity and network communities that better correspond to functional categorizations
Empirical Performance of Tree-Based Inference of Phylogenetic Networks
Phylogenetic networks extend the phylogenetic tree structure and allow for modeling vertical and horizontal evolution in a single framework. Statistical inference of phylogenetic networks is prohibitive and currently limited to small networks. An approach that could significantly improve phylogenetic network space exploration is based on first inferring an evolutionary tree of the species under consideration, and then augmenting the tree into a network by adding a set of "horizontal" edges to better fit the data.
In this paper, we study the performance of such an approach on networks generated under a birth-hybridization model and explore its feasibility as an alternative to approaches that search the phylogenetic network space directly (without relying on a fixed underlying tree). We find that the concatenation method does poorly at obtaining a "backbone" tree that could be augmented into the correct network, whereas the popular species tree inference method ASTRAL does significantly better at such a task. We then evaluated the tree-to-network augmentation phase under the minimizing deep coalescence and pseudo-likelihood criteria. We find that even though this is a much faster approach than the direct search of the network space, the accuracy is much poorer, even when the backbone tree is a good starting tree.
Our results show that tree-based inference of phylogenetic networks could yield very poor results. As exploration of the network space directly in search of maximum likelihood estimates or a representative sample of the posterior is very expensive, significant improvements to the computational complexity of phylogenetic network inference are imperative if analyses of large data sets are to be performed. We show that a recently developed divide-and-conquer approach significantly outperforms tree-based inference in terms of accuracy, albeit still at a higher computational cost
A Combinatorial Approach for Single-cell Variant Detection via Phylogenetic Inference
Single-cell sequencing provides a powerful approach for elucidating intratumor heterogeneity by resolving cell-to-cell variability. However, it also poses additional challenges including elevated error rates, allelic dropout and non-uniform coverage. A recently introduced single-cell-specific mutation detection algorithm leverages the evolutionary relationship between cells for denoising the data. However, due to its probabilistic nature, this method does not scale well with the number of cells. Here, we develop a novel combinatorial approach for utilizing the genealogical relationship of cells in detecting mutations from noisy single-cell sequencing data. Our method, called scVILP, jointly detects mutations in individual cells and reconstructs a perfect phylogeny among these cells. We employ a novel Integer Linear Program algorithm for deterministically and efficiently solving the joint inference problem. We show that scVILP achieves similar or better accuracy but significantly better runtime over existing methods on simulated data. We also applied scVILP to an empirical human cancer dataset from a high grade serous ovarian cancer patient
Progress on Constructing Phylogenetic Networks for Languages
In 2006, Warnow, Evans, Ringe, and Nakhleh proposed a stochastic model
(hereafter, the WERN 2006 model) of multi-state linguistic character evolution
that allowed for homoplasy and borrowing. They proved that if there is no
borrowing between languages and homoplastic states are known in advance, then
the phylogenetic tree of a set of languages is statistically identifiable under
this model, and they presented statistically consistent methods for estimating
these phylogenetic trees. However, they left open the question of whether a
phylogenetic network -- which would explicitly model borrowing between
languages that are in contact -- can be estimated under the model of character
evolution. Here, we establish that under some mild additional constraints on
the WERN 2006 model, the phylogenetic network topology is statistically
identifiable, and we present algorithms to infer the phylogenetic network. We
discuss the ramifications for linguistic phylogenetic network estimation in
practice, and suggest directions for future research.Comment: 16 pages, 2 figure
Bootstrap-based Support of HGT Inferred by Maximum Parsimony
Background: Maximum parsimony is one of the most commonly used criteria for reconstructing phylogenetic trees. Recently, Nakhleh and co-workers extended this criterion to enable reconstruction of phylogenetic networks, and demonstrated its application to detecting reticulate evolutionary relationships. However, one of the major problems with this extension has been that it favors more complex evolutionary relationships over simpler ones, thus having the potential for overestimating the amount of reticulation in the data. An ad hoc solution to this problem that has been used entails inspecting the improvement in the parsimony length as more reticulation events are added to the model, and stopping when the improvement is below a certain threshold. Results: In this paper, we address this problem in a more systematic way, by proposing a nonparametric bootstrapbased measure of support of inferred reticulation events, and using it to determine the number of those events, as well as their placements. A number of samples is generated from the given sequence alignment, and reticulation events are inferred based on each sample. Finally, the support of each reticulation event is quantified based on the inferences made over all samples. Conclusions: We have implemented our method in the NEPAL software tool (available publicly a
Reticulate evolutionary history and extensive introgression in mosquito species revealed by phylogenetic network analysis
The role of hybridization and subsequent introgression has been demonstrated in an increasing number of species. Recently, Fontaine et al. (Science, 347, 2015, 1258524) conducted a phylogenomic analysis of six members of the Anopheles gambiae species complex. Their analysis revealed a reticulate evolutionary history and pointed to extensive introgression on all four autosomal arms. The study further highlighted the complex evolutionary signals that the co-occurrence of incomplete lineage sorting (ILS) and introgression can give rise to in phylogenomic analyses. While tree-based methodologies were used in the study, phylogenetic networks provide a more natural model to capture reticulate evolutionary histories. In this work, we reanalyse the Anopheles data using a recently devised framework that combines the multispecies coalescent with phylogenetic networks. This framework allows us to capture ILS and introgression simultaneously, and forms the basis for statistical methods for inferring reticulate evolutionary histories. The new analysis reveals a phylogenetic network with multiple hybridization events, some of which differ from those reported in the original study. To elucidate the extent and patterns of introgression across the genome, we devise a new method that quantifies the use of reticulation branches in the phylogenetic network by each genomic region. Applying the method to the mosquito data set reveals the evolutionary history of all the chromosomes. This study highlights the utility of ‘network thinking’ and the new insights it can uncover, in particular in phylogenomic analyses of large data sets with extensive gene tree incongruence
- …