8 research outputs found
Boosting forward-time population genetic simulators through genotype compression
Background:
Forward-time population genetic simulations play a central role in deriving and testing evolutionary
hypotheses. Such simulations may be data-intensive, depending on the settings to the various param-
eters controlling them. In particular, for certain settings, the data footprint may quickly exceed the
memory of a single compute node.
Results:
We develop a novel and general method for addressing the memory issue inherent in forward-time
simulations by compressing and decompressing, in real-time, active and ancestral genotypes, while
carefully accounting for the time overhead. We propose a general graph data structure for compressing
the genotype space explored during a simulation run, along with efficient algorithms for constructing
and updating compressed genotypes which support both mutation and recombination. We tested the
performance of our method in very large-scale simulations. Results show that our method not only
scales well, but that it also overcomes memory issues that would cripple existing tools.
Conclusions:
As evolutionary analyses are being increasingly performed on genomes, pathways, and networks,
particularly in the era of systems biology, scaling population genetic simulators to handle large-scale
simulations is crucial. We believe our method offers a significant step in that direction. Further, the
techniques we provide are generic and can be integrated with existing population genetic simulators
to boost their performance in terms of memory usage
ncDNA and drift drive binding site accumulation
Background: The amount of transcription factor binding sites (TFBS) in an organism's genome positively correlates
with the complexity of the regulatory network of the organism. However, the manner by which TFBS arise and
accumulate in genomes and the effects of regulatory network complexity on the organism's fitness are far from being
known. The availability of TFBS data from many organisms provides an opportunity to explore these issues, particularly
from an evolutionary perspective.
Results: We analyzed TFBS data from five model organisms -- E. coli K12, S. cerevisiae, C. elegans, D. melanogaster, A.
thaliana -- and found a positive correlation between the amount of non-coding DNA (ncDNA) in the organismメs
genome and regulatory complexity. Based on this finding, we hypothesize that the amount of ncDNA, combined with
the population size, can explain the patterns of regulatory complexity across organisms. To test this hypothesis, we
devised a genome-based regulatory pathway model and subjected it to the forces of evolution through population
genetic simulations. The results support our hypothesis, showing neutral evolutionary forces alone can explain TFBS
patterns, and that selection on the regulatory network function does not alter this finding.
Conclusions: The cis-regulome is not a clean functional network crafted by adaptive forces alone, but instead a data
source filled with the noise of non-adaptive forces. From a regulatory perspective, this evolutionary noise manifests as
complexity on both the binding site and pathway level, which has significant implications on many directions in
microbiology, genetics, and synthetic biology
A Sequence-Based, Population Genetic Model of Regulatory Pathway Evolution
Complex phenotypes with genetic cause are understood through many processes, including regulatory pathways, but our evolutionary understanding of these critical structures is undermined by poor models which fail to preserve the underlying sequence structure and to incorporate population genetics. In response, this thesis builds a pathway model of evolution from its underlying sequence structure and validates it against a pertinent problem in genome evolution which uniquely leverage the developed model. Specifically, my model preserves sequence characteristics through a novel data structure and pathway-level mutation and recombination rates which are functions of sequence properties. The utility of the model is validated with a study quantifying the advantages and disadvantages of expansive non-coding DNA regions on the establishment of optimal pathways. Because the model presented in this thesis rectifies many fundamental problems in previous models, it may serve as a critical tool for future work in pathway evolution
Population Regulomics: Applying population genetics to the cis-regulome
Population genetics provides a mathematical and computational framework for understanding
and modeling evolutionary processes, and so it is vital for the investigation
of biological systems. In its current state, molecular population genetics is exclusively
focused on molecular sequences (DNA, RNA, or amino acid sequences), where
all application-ready simulators and analytic measures work only on sequence data.
Consequently, in the early 2000s, when technologies became available to sequence
entire genomes, population genetic approaches were naturally applied to mine out
signatures of selection and conservation, resulting in the subfi eld of population genomics.
Nearly every present genome project applies population genomic techniques
to identify functional information and genome structure.
Recent technologies have ushered in a similar wave of genetic information, this
time focusing on biological mechanisms operating above the genome, most notably
on gene regulation (regulatory networks). In this work, I develop a molecular population
genetics approach for gene regulation, called population regulomics, which
includes simulators and analytic measurements that operate on populations of regulatory
networks. I conducted extensive data analyses to connect the genome with the
cis-regulome, developed computationally effi cient simulators, and adapted population
genetic measurements on sequence to the regulatory network. By connecting genomic
information to cis-regulation, we may apply the wealth of knowledge at the genome
level to observed patterns at the regulatory level with unknown evolutionary origin.
I demonstrate that by applying population regulomics to the E. coli cis-regulatory
network, for the rst time we are able to quantify the evolutionary origins of topological
patterns and reveal the surprising amount of neutral signal in the bacterial
cis-regulome. Since regulatory networks play a central role in cellular functioning and,
consequently, organismal fitness, this new sub-fi eld of population regulomics promises
to shed the light of evolution on regulatory mechanisms and, more broadly, on the
genetic mechanisms underlying the various phenotypes
Indirect and suboptimal control of gene expression is widespread in bacteria
Gene regulation in bacteria is usually described as an adaptive response to an environmental change so that genes are expressed when they are required. We instead propose that most genes are under indirect control: their expression responds to signal(s) that are not directly related to the genes' function. Indirect control should perform poorly in artificial conditions, and we show that gene regulation is often maladaptive in the laboratory. In Shewanella oneidensis MR-1, 24% of genes are detrimental to fitness in some conditions, and detrimental genes tend to be highly expressed instead of being repressed when not needed. In diverse bacteria, there is little correlation between when genes are important for optimal growth or fitness and when those genes are upregulated. Two common types of indirect control are constitutive expression and regulation by growth rate; these occur for genes with diverse functions and often seem to be suboptimal. Because genes that have closely related functions can have dissimilar expression patterns, regulation may be suboptimal in the wild as well as in the laboratory